Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters

Abstract

Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such as computation and communication. This heterogeneity poses a significant challenge for the elastic scheduling of deep learning workloads. Unfortunately, existing elastic schedulers often overlook the impact of heterogeneity on scaling efficiency, resulting in considerably prolonged job completion times. In this paper, we present Heet, a new Heterogeneity-aware system explicitly developed for elastic training in DL clusters. Heet addresses two critical issues. First, it utilizes a 3-D collaborative filtering method to accurately measure the scaling efficiency of all elastic configurations on heterogeneous hosts, substantially reducing profiling overhead. Second, Heet introduces a unique price function to effectively balance scaling efficiency and scheduling efficiency. Building upon this function, Heet incorporates a scalable mechanism that employs minimum-weight full bipartite matching and opportunistic resource trading to generate dynamic scheduling decisions. Evaluations conducted on cloud clusters and large-scale simulations demonstrate that Heet can reduce job completion time by up to 2.46× compared to existing solutions.

Publication
In The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2024
Zizhao Mo
Zizhao Mo
2021 - Current
Huanle Xu
Huanle Xu
2021.01 - Current

I am currently an assistant professor from the Department of Computer and Information Scicence, Univeristy of Macau.