In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the perspective of a Cloud Service Provider running a data center, which efficiently selects resources for the execution of each job to minimize the average energy consumption while satisfying time constraints. To model the problem, we first develop a Mixed-Integer Non-Linear Programming formulation. Unfortunately, the computation of an optimal solution is prohibitively expensive, and to overcome this difficulty, we design a heuristic STochastic Scheduler (STS). Exploiting the probability distribution of early termination, STS determines how to adapt the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the job due dates. The results of an extensive experimental evaluation show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due date violations and yielding a percentage total cost reduction between 32% and 80% on average. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, i.e., running multiple jobs in a single GPU. The obtained results demonstrate that depending on the workload and GPU memory, this further reduces the energy cost by 17-29% on average.

Filippini, F., Anselmi, J., Ardagna, D., Gaujal, B. (2024). A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems. IEEE TRANSACTIONS ON CLOUD COMPUTING, 12(1), 53-69 [10.1109/TCC.2023.3336540].

A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems

Filippini F.;
2024

Abstract

In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the perspective of a Cloud Service Provider running a data center, which efficiently selects resources for the execution of each job to minimize the average energy consumption while satisfying time constraints. To model the problem, we first develop a Mixed-Integer Non-Linear Programming formulation. Unfortunately, the computation of an optimal solution is prohibitively expensive, and to overcome this difficulty, we design a heuristic STochastic Scheduler (STS). Exploiting the probability distribution of early termination, STS determines how to adapt the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the job due dates. The results of an extensive experimental evaluation show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due date violations and yielding a percentage total cost reduction between 32% and 80% on average. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, i.e., running multiple jobs in a single GPU. The obtained results demonstrate that depending on the workload and GPU memory, this further reduces the energy cost by 17-29% on average.
Articolo in rivista - Articolo scientifico
Average energy consumption minimization; deep learning; GPU cluster; GPU sharing; job tardiness; scheduling;
English
24-nov-2023
2024
12
1
53
69
open
Filippini, F., Anselmi, J., Ardagna, D., Gaujal, B. (2024). A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems. IEEE TRANSACTIONS ON CLOUD COMPUTING, 12(1), 53-69 [10.1109/TCC.2023.3336540].
File in questo prodotto:
File Dimensione Formato  
Filippini et al-2024-IEEE Transactions on Cloud Computing-AAM.pdf

accesso aperto

Tipologia di allegato: Author’s Accepted Manuscript, AAM (Post-print)
Licenza: Licenza open access specifica dell’editore
Dimensione 1.36 MB
Formato Adobe PDF
1.36 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/601070
Citazioni
  • Scopus 19
  • ???jsp.display-item.citation.isi??? 15
Social impact