Fast and Fair Training for Deep Learning in Heterogeneous GPU Clusters

Abstract

This paper presents FFT, a novel scheduling system designed for Fast and Fair deep learning Training in heterogeneous GPU clusters. FFT incorporates two key designs. First, it incorporates a resource allocation scheme in each round to enable fine-grained control over resource utilization. Second, it seamlessly integrates a fairness compensation mechanism that dynamically evaluates fairness in real-time. Building upon these designs, FFT formulates a cost minimization problem to determine the optimal schedule, striking a delicate balance between efficiency and fairness. Extensive experiments conducted in physical clusters as well as large-scale testbed demonstrate that FFT can significantly accelerate the overall JCT by up to 5.2x while improving job finish-time-fairness by more than 2.2x compared to state-of-the-art heterogeneity-aware solutions.

Publication
In ACM International Conference on Supercomputing (ICS) 2025
Zizhao Mo
Zizhao Mo
2021 - Current
Huanle Xu
Huanle Xu
2021.01 - Current

I am currently an assistant professor from the Department of Computer and Information Scicence, Univeristy of Macau.