Fast and Fair Training for Deep Learning in Heterogeneous GPU Clusters

March 2025

Abstract

This paper presents FFT, a novel scheduling system designed for Fast and Fair deep learning Training in heterogeneous GPU clusters. FFT incorporates two key designs. First, it incorporates a resource allocation scheme in each round to enable fine-grained control over resource utilization. Second, it seamlessly integrates a fairness compensation mechanism that dynamically evaluates fairness in real-time. Building upon these designs, FFT formulates a cost minimization problem to determine the optimal schedule, striking a delicate balance between efficiency and fairness. Extensive experiments conducted in physical clusters as well as large-scale testbed demonstrate that FFT can significantly accelerate the overall JCT by up to 5.2x while improving job finish-time-fairness by more than 2.2x compared to state-of-the-art heterogeneity-aware solutions.

Type

Conference paper

Publication

In ACM International Conference on Supercomputing (ICS) 2025

Fast and Fair Training for Deep Learning in Heterogeneous GPU Clusters

Abstract

Zizhao Mo

2021 - Current

Huanle Xu

2021 - Current