Artifact for Morphling (MLSys'25)

Overview

This is the artifact for the MLSys'25 paper "Morphling: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling".

You will be able to verify the functionality of the Morphling and reproduce the main results in the paper by following the instructions provided below in order.

Besides, we provide the code organization of Morphling at the end of this document. You can check the functionality of Morphling by reviewing the codes.

For AE Reviewers,

Morphling can be deployed in real GPU clusters. In our paper, we validated Morphling’s advantages using a GPU cluster equipped with 64 NVIDIA A800 GPUs. However, reproducing Table 4 in Section 7.3 requires access to the internal 64-GPU cluster in our organization, and completing the entire experiments are both costly and time-consuming (the sum of Makespan (h) values in Table 4 indicates that it would take at least 9 days to complete).

To facilitate the rapid verification of Morphling’s core functionalities, we collect the throughput values of the models under different resource amounts and execution plans in advance. We use these values to reproduce the performance model validations and micro-benchmarks in the paper. And we also provide a cluster simulator that emulates Morphling’s scheduling in the 64-GPU cluster. More details can be found in the simulator/README.md. It is important to highlight that in the simulator, the core components of the system, such as the performance model and scheduling algorithm in Morphling, function in the same way as they do in real GPU clusters.

Additionally, we are pleased to offer code and its organization to help Morphling setup in a real GPU cluster. These details can be found under sched/README.md.

Artifact Setup

For AE Reviewers (Using Public Docker Containers)

We have already setup the Docker containers and pushed to the public repository. The reviewers just need to pull the container:

docker pull zzxy180318/morphling-artifact:mlsys25ae

Prepare Environment By Yourself

You can also setup the containers by yourself by using Dockerfile:

docker build -t morphling:mlsys25ae .

Launch The Docker Images

Finally, you can launch the docker image by:

docker run -tid --name morphling-artifact morphling:mlsys25ae
docker exec -it morphling-artifact /bin/bash

Getting Started Instructions (a naive scheduling example)

To verify that the environment has been successfully built and to check the basic functionality of the artifact, we provided a naive example that can be finished in less than 1 minute.

cd artifact/0_getting_started

# Submit 10 workloads, schedule them with Morphling and wait them for completion.
./morphling_exp simulator/workloads/naive.csv morphling 60 8 8 96 1600 400 100 0 'naive'

During the execution of the script, Morphling continuously prints the active jobs, GPU utilization, completed jobs, average job completion time and makespan at each simulator interval. You should be able to see logs like:

...
---------------- SIMULATOR TIME: xxxx ----------------
Active jobs:
    llama30-8:  [restarts x]    [placement (x,x,x,x)]
GPU utilization: xx
Completed jobs:
{'xx-0': xxxx, 'xx-1': xxxx, ...}
Average JCT (s): xxxxx
Makespan (s): xxxx
...

The final values of Average JCT and Makespan is the average job completion time and the makespan of the trace, respectively.

If the above command is executed successfully without getting stuck for a long time, it means that we have built the environment correctly and can go to the next detailed instructions section.

Detailed Instructions (to validate the functions and reproduce the evaluation results)

The estimated time of running all experiments below for once is about 2 hours.

*You can execute the *.ipynb files we provided for the experiments. Alternatively, you can also directly view the *.md files, which are generated by the .ipynb files, to obtain the experimental results in Section 7.1 and Section 7.2.

7.1 Performance Model Validation (Table 2)

Run script to generate a new Notebook file that includes the output.

cd ./artifact/71_performance_model
jupyter nbconvert --execute validation.ipynb --to notebook

Alternatively, you can directly run all cells by clicking Cell > Run All on the menu in Jupyter Notebook or JupyterLab.

We can get all the results in Table 2 of the paper through the steps above.

You can also run the validation experiments for each model individually to view the model parameter fitting results and performance prediction results. The file contains two functions, fit_XX() for parameter fitting and validate() for performance prediction.

Run script:

cd ./artifact/71_performance_model/model
# All the files formatted as validate_{model_name}.py can be executed in the following way 
python validate_bert.py
python validate_gpt.py
...

7.2 Micro-benchmarks: Adapting to changing resource limits (Figure 7)

One of the key advantages of Morphling is its reconfigurability, which allows it to always choose the best execution plan under varying resource limits. This experiment evaluates Morphling's reconfigurability by continuously reducing the available resources.

Run script to generate a new Notebook file that includes the output.

cd ./artifact/72_micro_benchmarks
jupyter nbconvert --execute Figure7.ipynb --to notebook

Alternatively, you can directly run all cells by clicking Cell > Run All in Jupyter Notebook or JupyterLab.

We can get Figure 7 of the paper through the steps above.

7.2 Micro-benchmarks: Maximizing throughput across jobs (Figure 8)

Another function of Morphling is to maximize throughput across jobs considering the resource sensitivity. In this experiment, we submit two jobs to a cluster of 4 A800 GPUs.

Run script to generate a new Notebook file that includes the output.

cd ./artifact/72_micro_benchmarks
jupyter nbconvert --execute Figure8.ipynb --to notebook

Alternatively, you can directly run all cells by clicking Cell > Run All in Jupyter Notebook or JupyterLab.

We can get Figure 8 of the paper through the steps above.

7.2 Micro-benchmarks: Accuracy during reconfiguration (Figure 9)

Morphling keeps the global batch size unchanged during reconfiguration, ensuring training accuracy is not affected by design.

To evaluate that Morphling preserves training accuracy, we analyze the training loss across 3000 mini-batch under different resource configurations and execution plans. The experiment results are based on actual runs on the GPUs. We have saved the logs of the loss changes in the ./artifact/72_micro_benchmarks/loss_measurement. Due to data sensitivity, it only involves the LLaMA-2-7B model. The following script is used to process the training logs and generate plots.

Run script to generate a new Notebook file that includes the output.

cd ./artifact/72_micro_benchmarks
jupyter nbconvert --execute Figure9.ipynb --to notebook

Alternatively, you can directly run all cells by clicking Cell > Run All in Jupyter Notebook or JupyterLab.

7.3 Simulation: Cluster Experiments

Our simulation experiments simulate the GPU cluster with 8 nodes, each containing 8 A800 GPUs, as described in section 7.4 of the paper.

We selected three experiments from Section 7.3 to validate the full capabilities of the Morphling (including the performance models and the scheduling algorithms). Specifically, we use the same Base trace, Multi-tenant trace, and Best-plan trace as in Table 4 to conduct end-to-end simulation experiments. For detailed descriptions of these three traces, please refer to Section 7.3.

cd artifact/73_cluster_experiment

# Submit 406 workloads, schedule them using Morphling and wait them for completion.

# Base Trace
./morphling_exp simulator/workloads/workload-base.csv morphling 60 8 8 96 1600 400 100 0 'Base'

# BP (best-plan) Trace
./morphling_exp simulator/workloads/workload-bp.csv morphling 60 8 8 96 1600 400 100 0 'BP'

# MT (multi-tenant) Trace
./morphling_exp simulator/workloads/workload-mt.csv morphling 60 8 8 96 1600 400 100 0 'MT'

During the experiment, all the resource allocations (per-job resource amounts and placements) and job statuses (submission, queuing, running, completion) are saved in "artifact/73_cluster_experiment/{trace_name}_{date-time}.log". The final "Average JCT" and "Makespan" values in the logs represent the scheduling results for that trace, corresponding to the "Avg. JCT" and "Makespan" in Table 4. AE reviewers can feel free to view the logs. The experiments can conclude that:

The successful execution of the above experiments verifies that Morphling can effectively schedule a large number of jobs in the 64-GPU cluster.
All the three experiments validate Morphling's ability to reconfigure the plan together with resource scaling during the scheduling process, and the comparison with SOTA baselines in Table 4 demonstrates that Morphling can maximize cluster throughput.
The experiment using MT trace validates Morphling can provide performance guarantees to guaranteed jobs.

Testbed Setup

The testbed experiments require 8 nodes, each with 8 NVIDIA A800 GPUs (80 GB), 96 vCPUs, 1,600 GB memory, 400 GB/s NVLink bandwidth, and 100 GB/s RDMA network bandwidth. The experiments in the paper are also highly related to internal testbed platform.

Please see sched/README.md for more details.

Code Organization (to check functionality)

We list the code organization of the Morphling project to help the AE reviewers quickly understand the roles of each part in the project.

- Morphling
  - artifact // Reproduce the main evaluation results of the paper. Feel free to execute the code following the instructions above.
  - benchmark // Implement the transformer model in Table 1 and manage the workloads in real GPU cluster.
  - sched
    - models // The specification of the transformer model.
    - morphling-sched // Core functionalities of Morphling, including the scheduling algorithm and the performance model.
    - ... // Other files are mainly used to manage the workloads and resources in real GPU cluster, such as implementing the scheduling decision.
- simulator 
    - traces // The training throughput values collected in advance for each model in Table 1 with different resource amounts and execution plan.
    - .. // Other files are mainly used to manage the workloads and resources in simulation GPU cluster. Note that the simulator invokes the classes and functions in `sched/morphling-sched` to use Morphling.
  - ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Artifact for Morphling (MLSys'25)

Overview

Artifact Setup

For AE Reviewers (Using Public Docker Containers)

Prepare Environment By Yourself

Launch The Docker Images

Getting Started Instructions (a naive scheduling example)

Detailed Instructions (to validate the functions and reproduce the evaluation results)

7.1 Performance Model Validation (Table 2)

7.2 Micro-benchmarks: Adapting to changing resource limits (Figure 7)

7.2 Micro-benchmarks: Maximizing throughput across jobs (Figure 8)

7.2 Micro-benchmarks: Accuracy during reconfiguration (Figure 9)

7.3 Simulation: Cluster Experiments

Testbed Setup

Code Organization (to check functionality)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
artifact		artifact
benchmark		benchmark
sched		sched
simulator		simulator
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

AlibabaPAI/reconfigurable-dl-scheduler

Folders and files

Latest commit

History

Repository files navigation

Artifact for Morphling (MLSys'25)

Overview

Artifact Setup

For AE Reviewers (Using Public Docker Containers)

Prepare Environment By Yourself

Launch The Docker Images

Getting Started Instructions (a naive scheduling example)

Detailed Instructions (to validate the functions and reproduce the evaluation results)

7.1 Performance Model Validation (Table 2)

7.2 Micro-benchmarks: Adapting to changing resource limits (Figure 7)

7.2 Micro-benchmarks: Maximizing throughput across jobs (Figure 8)

7.2 Micro-benchmarks: Accuracy during reconfiguration (Figure 9)

7.3 Simulation: Cluster Experiments

Testbed Setup

Code Organization (to check functionality)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages