You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+77-55Lines changed: 77 additions & 55 deletions
Original file line number
Diff line number
Diff line change
@@ -1,85 +1,96 @@
1
1
# Flux
2
2
3
-
Flux is a fast communication-overlapping library for tensor parallelism on GPUs.
3
+
Flux is a communication-overlapping library for dense/MoE models on GPUs, providing high-performance and pluggable kernels to support various parallelisms in model training/inference.
4
4
5
+
Flux's efficient kernels are compatible with Pytorch and can be integrated into existing frameworks easily, supporting various Nvidia GPU architectures and data types.
5
6
6
-
## Why Flux
7
+
Welcome to join the [Wechat](https://github.com/bytedance/flux/blob/main/docs/assets/comet_wechat_group.JPG) group and stay tuned!
7
8
8
-
Flux can significantly reduce latency and increase throughput for tensor parallelism for both inference and training.
Here is a snippet to install Flux in a virtual environment. Let's finish the installation in an virtual environment with CUDA 12.4, torch 2.6.0 and python 3.11.
FLUX relies on NVSHMEM for communication across nodes. Therefore, if you need support for cross-machine tensor parallelism (TP), you must manually download the NVSHMEM source code and enable the nvshmem option during compilation.
40
+
41
+
Then you would expect a wheel package under `dist/` folder that is suitable for your virtual environment.
42
+
43
+
### Install from PyPI
44
+
We also provide some pre-built wheels for Flux, and you can directly install with pip if your wanted version is available. Currently we provide wheels for the following configurations: torch(2.4.0, 2.5.0, 2.6.0), python(3.10, 3.11), cuda(12.4).
28
45
29
46
```bash
30
-
git clone https://github.com/bytedance/flux.git
31
-
# Download nvshmem-2.11(https://developer.nvidia.com/nvshmem) and place it to flux/3rdparty/nvshmem
32
-
# Flux is temporarily dependent on a specific version of nvshmem (2.11).
If you are tired of the cmake process, you can set environment variable `FLUX_BUILD_SKIP_CMAKE` to 1 to skip cmake if `build/CMakeCache.txt` already exists.
51
+
### Customized Installation
52
+
#### Build options for source installation
44
53
45
-
If you want to build a wheel package, add `--package` to the build command. find the output wheel file under dist/
54
+
1. Add `--nvshmem` to build Flux with NVSHMEM support. It is essential for the MoE kernels.
55
+
2. If you are tired of the cmake process, you can set environment variable `FLUX_BUILD_SKIP_CMAKE` to 1 to skip cmake if `build/CMakeCache.txt` already exists.
56
+
3. If you want to build a wheel package, add `--package` to the build command. find the output wheel file under dist/
46
57
47
-
```bash
48
-
# Ampere
49
-
./build.sh --arch 80 --package
50
58
51
-
# Hopper
52
-
./build.sh --arch 90 --package
53
-
```
59
+
#### Dependencies
60
+
The core dependencies of Flux are NCCL, CUTLASS, and NVSHMEM, which are located under the 3rdparty folder.
61
+
1. NCCL: Managed by git submodule automatically.
62
+
2. NVSHMEM: Downloaded from https://developer.nvidia.com/nvshmem. The current version is 3.2.5-1.
63
+
3. CUTLASS: Flux leverages CUTLASS to generate high-performance GEMM kernels. We currently use CUTLASS 3.7.0 and a tiny patch should be applied to CUTLASS.
64
+
54
65
66
+
## Quick Start
55
67
56
-
## Run Demo
68
+
Below are commands to run some basic demos once you have installed Flux successfully.
We measured the examples from the above demo on both A800s and H800s. Each machine has 8 GPUs, with a TP size set to 8. The table below shows the performance comparison between flux and torch+nccl. It can be observed that by overlapping fine-grained computation and communication, Flux is able to effectively hide a significant portion of the communication time
79
+
# all-gather fused with grouped gemm (MoE MLP layer0)
You can check out the documentations for more details!
77
87
78
-
AG refers to AllGather.
79
-
RS refers to ReduceScatter.
88
+
* For a more detailed usage on MoE kernels, please refer to [Flux MoE Usage](https://github.com/bytedance/flux/blob/main/docs/moe_usage.md). Try some [examples](https://github.com/bytedance/flux/blob/main/examples) as a quick start. A [minimal MoE layer](https://github.com/bytedance/flux/blob/main/examples/moe_flux_only.py) can be implemented within only a few tens of lines of code using Flux!
89
+
* For some performance numbers, please refer to [Performance Doc](https://github.com/bytedance/flux/blob/main/docs/performance.md).
90
+
* To learn more about the design principles of Flux, please refer to [Design Doc](https://github.com/bytedance/flux/blob/main/docs/design.md).
80
91
81
92
82
-
## Citing
93
+
## Citations
83
94
84
95
If you use Flux in a scientific publication, we encourage you to add the following reference
85
96
to the related papers:
@@ -92,11 +103,22 @@ to the related papers:
92
103
archivePrefix={arXiv},
93
104
primaryClass={cs.LG}
94
105
}
106
+
107
+
@misc{zhang2025comet,
108
+
title={Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts},
0 commit comments