This is the artifact for the MLSys'25 paper "Morphling: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling".
You will be able to verify the functionality of the Morphling and reproduce the main results in the paper by following the instructions provided below in order.
Besides, we provide the code organization of Morphling at the end of this document. You can check the functionality of Morphling by reviewing the codes.
For AE Reviewers,
Morphling can be deployed in real GPU clusters. In our paper, we validated Morphling’s advantages using a GPU cluster equipped with 64 NVIDIA A800 GPUs. However, reproducing Table 4 in Section 7.3 requires access to the internal 64-GPU cluster in our organization, and completing the entire experiments are both costly and time-consuming (the sum of
Makespan (h)
values in Table 4 indicates that it would take at least 9 days to complete).To facilitate the rapid verification of Morphling’s core functionalities, we collect the throughput values of the models under different resource amounts and execution plans in advance. We use these values to reproduce the performance model validations and micro-benchmarks in the paper. And we also provide a cluster simulator that emulates Morphling’s scheduling in the 64-GPU cluster. More details can be found in the
simulator/README.md
. It is important to highlight that in the simulator, the core components of the system, such as the performance model and scheduling algorithm in Morphling, function in the same way as they do in real GPU clusters.Additionally, we are pleased to offer code and its organization to help Morphling setup in a real GPU cluster. These details can be found under
sched/README.md
.
We have already setup the Docker containers and pushed to the public repository. The reviewers just need to pull the container:
docker pull zzxy180318/morphling-artifact:mlsys25ae
You can also setup the containers by yourself by using Dockerfile:
docker build -t morphling:mlsys25ae .
Finally, you can launch the docker image by:
docker run -tid --name morphling-artifact morphling:mlsys25ae
docker exec -it morphling-artifact /bin/bash
To verify that the environment has been successfully built and to check the basic functionality of the artifact, we provided a naive example that can be finished in less than 1 minute.
cd artifact/0_getting_started
# Submit 10 workloads, schedule them with Morphling and wait them for completion.
./morphling_exp simulator/workloads/naive.csv morphling 60 8 8 96 1600 400 100 0 'naive'
During the execution of the script, Morphling continuously prints the active jobs, GPU utilization, completed jobs, average job completion time and makespan at each simulator interval. You should be able to see logs like:
...
---------------- SIMULATOR TIME: xxxx ----------------
Active jobs:
llama30-8: [restarts x] [placement (x,x,x,x)]
GPU utilization: xx
Completed jobs:
{'xx-0': xxxx, 'xx-1': xxxx, ...}
Average JCT (s): xxxxx
Makespan (s): xxxx
...
The final values of Average JCT
and Makespan
is the average job completion time and the makespan of the trace, respectively.
If the above command is executed successfully without getting stuck for a long time, it means that we have built the environment correctly and can go to the next detailed instructions section.
The estimated time of running all experiments below for once is about 2 hours.
*You can execute the *.ipynb files we provided for the experiments. Alternatively, you can also directly view the *.md files, which are generated by the .ipynb files, to obtain the experimental results in Section 7.1 and Section 7.2.
- Run script to generate a new Notebook file that includes the output.
cd ./artifact/71_performance_model
jupyter nbconvert --execute validation.ipynb --to notebook
- Alternatively, you can directly run all cells by clicking
Cell > Run All
on the menu in Jupyter Notebook or JupyterLab.
We can get all the results in Table 2 of the paper through the steps above.
You can also run the validation experiments for each model individually to view the model parameter fitting results and performance prediction results. The file contains two functions, fit_XX()
for parameter fitting and validate()
for performance prediction.
- Run script:
cd ./artifact/71_performance_model/model
# All the files formatted as validate_{model_name}.py can be executed in the following way
python validate_bert.py
python validate_gpt.py
...
One of the key advantages of Morphling is its reconfigurability, which allows it to always choose the best execution plan under varying resource limits. This experiment evaluates Morphling's reconfigurability by continuously reducing the available resources.
- Run script to generate a new Notebook file that includes the output.
cd ./artifact/72_micro_benchmarks
jupyter nbconvert --execute Figure7.ipynb --to notebook
- Alternatively, you can directly run all cells by clicking
Cell > Run All
in Jupyter Notebook or JupyterLab.
We can get Figure 7 of the paper through the steps above.
Another function of Morphling is to maximize throughput across jobs considering the resource sensitivity. In this experiment, we submit two jobs to a cluster of 4 A800 GPUs.
- Run script to generate a new Notebook file that includes the output.
cd ./artifact/72_micro_benchmarks
jupyter nbconvert --execute Figure8.ipynb --to notebook
- Alternatively, you can directly run all cells by clicking
Cell > Run All
in Jupyter Notebook or JupyterLab.
We can get Figure 8 of the paper through the steps above.
Morphling keeps the global batch size unchanged during reconfiguration, ensuring training accuracy is not affected by design.
To evaluate that Morphling preserves training accuracy, we analyze the training loss across 3000 mini-batch under different resource configurations and execution plans. The experiment results are based on actual runs on the GPUs. We have saved the logs of the loss changes in the ./artifact/72_micro_benchmarks/loss_measurement
. Due to data sensitivity, it only involves the LLaMA-2-7B model. The following script is used to process the training logs and generate plots.
- Run script to generate a new Notebook file that includes the output.
cd ./artifact/72_micro_benchmarks
jupyter nbconvert --execute Figure9.ipynb --to notebook
- Alternatively, you can directly run all cells by clicking
Cell > Run All
in Jupyter Notebook or JupyterLab.
Our simulation experiments simulate the GPU cluster with 8 nodes, each containing 8 A800 GPUs, as described in section 7.4 of the paper.
We selected three experiments from Section 7.3 to validate the full capabilities of the Morphling (including the performance models and the scheduling algorithms). Specifically, we use the same Base trace
, Multi-tenant trace
, and Best-plan trace
as in Table 4 to conduct end-to-end simulation experiments. For detailed descriptions of these three traces, please refer to Section 7.3.
cd artifact/73_cluster_experiment
# Submit 406 workloads, schedule them using Morphling and wait them for completion.
# Base Trace
./morphling_exp simulator/workloads/workload-base.csv morphling 60 8 8 96 1600 400 100 0 'Base'
# BP (best-plan) Trace
./morphling_exp simulator/workloads/workload-bp.csv morphling 60 8 8 96 1600 400 100 0 'BP'
# MT (multi-tenant) Trace
./morphling_exp simulator/workloads/workload-mt.csv morphling 60 8 8 96 1600 400 100 0 'MT'
During the experiment, all the resource allocations (per-job resource amounts and placements) and job statuses (submission, queuing, running, completion) are saved in "artifact/73_cluster_experiment/{trace_name}_{date-time}.log". The final "Average JCT" and "Makespan" values in the logs represent the scheduling results for that trace, corresponding to the "Avg. JCT" and "Makespan" in Table 4. AE reviewers can feel free to view the logs. The experiments can conclude that:
- The successful execution of the above experiments verifies that Morphling can effectively schedule a large number of jobs in the 64-GPU cluster.
- All the three experiments validate Morphling's ability to reconfigure the plan together with resource scaling during the scheduling process, and the comparison with SOTA baselines in Table 4 demonstrates that Morphling can maximize cluster throughput.
- The experiment using MT trace validates Morphling can provide performance guarantees to guaranteed jobs.
The testbed experiments require 8 nodes, each with 8 NVIDIA A800 GPUs (80 GB), 96 vCPUs, 1,600 GB memory, 400 GB/s NVLink bandwidth, and 100 GB/s RDMA network bandwidth. The experiments in the paper are also highly related to internal testbed platform.
Please see sched/README.md
for more details.
We list the code organization of the Morphling project to help the AE reviewers quickly understand the roles of each part in the project.
- Morphling
- artifact // Reproduce the main evaluation results of the paper. Feel free to execute the code following the instructions above.
- benchmark // Implement the transformer model in Table 1 and manage the workloads in real GPU cluster.
- sched
- models // The specification of the transformer model.
- morphling-sched // Core functionalities of Morphling, including the scheduling algorithm and the performance model.
- ... // Other files are mainly used to manage the workloads and resources in real GPU cluster, such as implementing the scheduling decision.
- simulator
- traces // The training throughput values collected in advance for each model in Table 1 with different resource amounts and execution plan.
- .. // Other files are mainly used to manage the workloads and resources in simulation GPU cluster. Note that the simulator invokes the classes and functions in `sched/morphling-sched` to use Morphling.
- ...