Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters

April 2024

Abstract

Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such as computation and communication. This heterogeneity poses a significant challenge for the elastic scheduling of deep learning workloads. Unfortunately, existing elastic schedulers often overlook the impact of heterogeneity on scaling efficiency, resulting in considerably prolonged job completion times. In this paper, we present Heet, a new Heterogeneity-aware system explicitly developed for elastic training in DL clusters. Heet addresses two critical issues. First, it utilizes a 3-D collaborative filtering method to accurately measure the scaling efficiency of all elastic configurations on heterogeneous hosts, substantially reducing profiling overhead. Second, Heet introduces a unique price function to effectively balance scaling efficiency and scheduling efficiency. Building upon this function, Heet incorporates a scalable mechanism that employs minimum-weight full bipartite matching and opportunistic resource trading to generate dynamic scheduling decisions. Evaluations conducted on cloud clusters and large-scale simulations demonstrate that Heet can reduce job completion time by up to 2.46× compared to existing solutions.

Type

Conference paper

Publication

In The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2024

Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters

Abstract

Zizhao Mo

2021 - Current

Huanle Xu

2021 - Current

Chengzhong Xu

2019 - Current