Skip to content

Commit 536dd85

Browse files
authored
release: v0.2.0 (#206)
* fix manifest.in * remove tools dir * __version__.py: 0.2.0 * update readme * fix docs * fix best-practice.md * readme: nccl path * readme: improve * readme: improve * add changelog.rst * fix changelog * readme: add pypi badge and news * improve readme and changelog
1 parent af6fd58 commit 536dd85

File tree

11 files changed

+81
-207
lines changed

11 files changed

+81
-207
lines changed

CHANGELOG.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2+
Changelog for BytePS
3+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4+
5+
0.2.0 (2020-02)
6+
------------------
7+
* Largely improve RDMA performance by enforcing page aligned memory.
8+
* Add IPC support for RDMA. Now support colocating servers and workers without sacrificing much performance.
9+
* Fix a hanging bug in BytePS server.
10+
* Fix RDMA-related segmentation fault problem during fork() (e.g., used by PyTorch data loader).
11+
* New feature: Enable mixing use of colocate and non-colocate servers, along with a smart tensor allocation strategy.
12+
* New feature: Add ``bpslaunch`` as the command to launch tasks.
13+
* Add support for pip install: ``pip3 install byteps``
14+
15+
16+
0.1.0 (2019-12)
17+
------------------
18+
* First official release.

MANIFEST.in

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1-
include */*
1+
include */* LICENSE byteps.lds byteps.exp
2+
exclude .git/*
23
recursive-include * *.cc *.h
34
graft 3rdparty/ps-lite

README.md

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,18 @@
22

33
[![Build Status](https://travis-ci.org/bytedance/byteps.svg?branch=master)](https://travis-ci.org/bytedance/byteps)
44
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
5+
![Pypi](https://img.shields.io/pypi/v/byteps.svg)
56

67
BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on either TCP or RDMA network.
78

89
BytePS outperforms existing open-sourced distributed training frameworks by a large margin. For example, on BERT-large training, BytePS can achieve ~90% scaling efficiency with 256 GPUs (see below), which is much higher than [Horovod](https://github.com/horovod/horovod)+[NCCL](https://github.com/NVIDIA/nccl). In certain scenarios, BytePS can double the training speed compared with Horovod+NCCL.
910

1011
## News
1112

13+
- [BytePS-0.2.0](CHANGELOG.rst) has been released.
14+
- Now pip install is available, refer to the [install tutorial](https://github.com/bytedance/byteps#quick-start).
15+
- [Largely improve RDMA performance](https://github.com/bytedance/byteps/pull/184). Now support colocating servers and workers with high performance.
16+
- Fix [RDMA fork problem](https://github.com/bytedance/byteps/pull/192) caused by multi-processing.
1217
- [New Server](https://github.com/bytedance/byteps/pull/151): We improve the server performance by a large margin, and it is now independent of MXNet KVStore. Try our [new docker images](docker/).
1318
- Use [the ssh launcher](launcher/) to launch your distributed jobs
1419
- [Improved key distribution strategy for better load-balancing](https://github.com/bytedance/byteps/pull/116)
@@ -41,21 +46,30 @@ BytePS also incorporates many acceleration techniques such as hierarchical strat
4146

4247
We provide a [step-by-step tutorial](docs/step-by-step-tutorial.md) for you to run benchmark training tasks. The simplest way to start is to use our [docker images](docker). Refer to [Documentations](docs) for how to [launch distributed jobs](docs/running.md) and more [detailed configurations](docs/env.md). After you can start BytePS, read [best practice](docs/best-practice.md) to get the best performance.
4348

44-
Below, we explain how to build and run BytePS by yourself. BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet. BytePS depends on CUDA and NCCL, and requires gcc>=4.9. If you are working on CentOS/Redhat and have gcc<4.9, you can try `yum install devtoolset-7` before everything else.
49+
Below, we explain how to install BytePS by yourself. There are two options.
50+
51+
### Install by pip
52+
53+
```
54+
pip3 install byteps
55+
```
4556

4657
### Build from source code
4758

48-
If the above does not contain your desired wheel resource, or you want to try building from source code:
59+
You can try out the latest features by directly installing from master branch:
4960

5061
```
51-
git clone --recurse-submodules https://github.com/bytedance/byteps
62+
git clone --recursive https://github.com/bytedance/byteps
5263
cd byteps
53-
python setup.py install
64+
python3 setup.py install
5465
```
5566

56-
Notes:
57-
- For best compatibility, please pin your gcc to 4.9 before building, [here](https://github.com/bytedance/byteps/blob/master/docker/Dockerfile.pytorch#L72-L80) is an example.
58-
- You may set `BYTEPS_USE_RDMA=1` to install with RDMA support. Before this, make sure your RDMA drivers have been properly installed and tested.
67+
Notes for above two options:
68+
- BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet.
69+
- BytePS depends on CUDA and NCCL. You should specify the NCCL path with `export BYTEPS_NCCL_HOME=/path/to/nccl`. By default it points to `/usr/local/nccl`.
70+
- The installation requires gcc>=4.9. If you are working on CentOS/Redhat and have gcc<4.9, you can try `yum install devtoolset-7` before everything else. In general, we recommend using gcc 4.9 for best compatibility ([an example](https://github.com/bytedance/byteps/blob/3fba75def0d81c1d3225f8f397cc985200f57de7/docker/Dockerfile.mxnet#L72-L80) to pin gcc).
71+
- RDMA support: During setup, the script will automatically detect the RDMA header file. If you want to use RDMA, make sure your RDMA environment has been properly installed and tested before install ([an example](https://github.com/bytedance/byteps/blob/3fba75def0d81c1d3225f8f397cc985200f57de7/docker/Dockerfile.mxnet#L29-L33) for Ubuntu-18.04).
72+
5973

6074
## Use BytePS in Your Code
6175

byteps/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
VERSION = (0, 1, 0)
1+
VERSION = (0, 2, 0)
22

33
__version__ = '.'.join(map(str, VERSION))

docs/best-practice.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,23 @@ If you have NVLinks, leave `BYTEPS_PCIE_SWITCH_SIZE` unmodified. If you don't kn
1313

1414
## Multi-machine (distributed mode)
1515

16-
This mode requires at least **4** physical machines, otherwise you won't see any benefits of BytePS. Two of the machines should have GPUs and run as workers. The other two run as servers and do not need GPUs. The scheduler can run on any machine.
16+
### With additional CPU servers
17+
18+
This mode requires at least **4** physical machines. Two of the machines should have GPUs and run as workers. The other two run as CPU servers and do not need GPUs. The scheduler can run on any machine.
1719

1820
The key here is to make sure the following:
1921
* Servers must be on different physical machines from workers.
2022
* The total bandwidth of the servers must be equal or larger than the total bandwidth of workers.
2123

2224
If you are using RDMA, this should be sufficient. However, with TCP and >=25Gbps networks, it's possible that BytePS cannot fully utilize the bandwidth because a single TCP connection usually cannot run up to 25Gbps.
2325

24-
To address this, you can try running more BytePS server instances on the server machines. For example, you can try running two server instances per server machines. This effectively doubles the number of TCP connections and should be sufficient for 25Gbps networks. For 40Gbps/50Gbps networks, you need three server instances per server machine, and so on. When doing this, you probably need to set `MXNET_OMP_MAX_THREADS` as: your CPU cores number divided by number of server instances per machine. For example, one machine has 32 cores and you put 4 server instances on it, then you need to `export MXNET_OMP_MAX_THREADS=8`. The idea is to reduce the CPU contention of different server instances.
26+
To address this, you can try running more BytePS server instances on the server machines. For example, you can try running two server instances per server machines. This effectively doubles the number of TCP connections and should be sufficient for 25Gbps networks. For 40Gbps/50Gbps networks, you need three server instances per server machine, and so on.
27+
28+
### No additional CPU servers
29+
30+
When you don't have additional CPU servers, then for each physical machine, you should launch a worker and a server process. We call this *co-locate* mode, and the resource consumption is the same with Horovod (no additional servers).
31+
32+
If you are using TCP, you will probably get near-identical performance with Horovod-TCP. However, if you are using RDMA, you can set `BYTEPS_ENABLE_IPC=1` to enable the IPC communication between the co-located worker and server. And eventually you will get higher end-to-end performance than Horovod.
2533

2634
## The expected performance
2735

docs/env.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -107,16 +107,16 @@ export BYTEPS_NCCL_GROUP_SIZE=w
107107
```
108108

109109
Servers can also be the performance bottleneck, e.g., when there are only one server but multiple workers.
110-
You can try to increase the number of push threads on the servers (default is 1):
110+
You can try to increase the number of processing threads on the servers (default is 4):
111111

112112
```
113-
export SERVER_PUSH_NTHREADS=v
113+
export BYTEPS_SERVER_ENGINE_THREAD=v
114114
```
115115

116-
Increasing the number of engine CPU threads may also improves server performance:
116+
Or enable scheduling at the server side to prioritize tensors with higher priority:
117117

118118
```
119-
export MXNET_CPU_WORKER_NTHREADS=p
119+
export BYTEPS_SERVER_ENABLE_SCHEDULE=1
120120
```
121121

122122
## Asynchronous training

docs/running.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,15 @@ On worker 0, run:
1111
```
1212
DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
1313
DMLC_WORKER_ID=0 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \
14-
python launcher/launcher.py YOUR_COMMAND
14+
bpslaunch YOUR_COMMAND
1515
```
1616

1717
On worker 1, run (only DMLC_WORKER_ID is different from above):
1818

1919
```
2020
DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
2121
DMLC_WORKER_ID=1 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \
22-
python launcher/launcher.py YOUR_COMMAND
22+
bpslaunch YOUR_COMMAND
2323
```
2424

2525
**For servers and schedulers, we highly recommend you use the docker image we build:**
@@ -32,14 +32,14 @@ Start server and scheduler docker instances with this image. In the server, run
3232

3333
```
3434
DMLC_ROLE=server DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
35-
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py
35+
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 bpslaunch
3636
```
3737

3838
On the scheduler, run (we also remove DMLC_WORKER_ID, and set role to scheduler):
3939

4040
```
4141
DMLC_ROLE=scheduler DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
42-
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py
42+
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 bpslaunch
4343
```
4444

4545
In this example, your scheduler must be able to bind to `10.0.0.1:9000`.

docs/step-by-step-tutorial.md

Lines changed: 15 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,7 @@ export DMLC_NUM_SERVER=1
2323
export DMLC_PS_ROOT_URI=10.0.0.1
2424
export DMLC_PS_ROOT_PORT=1234
2525
26-
python3 /usr/local/byteps/launcher/launch.py \
27-
python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
28-
--model ResNet50 --num-iters 1000000
26+
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
2927
```
3028

3129
### PyTorch
@@ -47,9 +45,7 @@ export DMLC_NUM_SERVER=1
4745
export DMLC_PS_ROOT_URI=10.0.0.1
4846
export DMLC_PS_ROOT_PORT=1234
4947
50-
python3 /usr/local/byteps/launcher/launch.py \
51-
python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py \
52-
--model resnet50 --num-iters 1000000
48+
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000
5349
```
5450

5551
### MXNet
@@ -70,9 +66,7 @@ export DMLC_NUM_SERVER=1
7066
export DMLC_PS_ROOT_URI=10.0.0.1
7167
export DMLC_PS_ROOT_PORT=1234
7268
73-
python3 /usr/local/byteps/launcher/launch.py \
74-
python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py \
75-
--benchmark 1 --batch-size=32
69+
bpslaunch python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32
7670
```
7771

7872
## Distributed Training (TCP)
@@ -95,7 +89,7 @@ export DMLC_NUM_SERVER=1
9589
export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP
9690
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
9791
98-
python3 /usr/local/byteps/launcher/launch.py
92+
bpslaunch
9993
```
10094

10195
For the server:
@@ -111,7 +105,7 @@ export DMLC_NUM_SERVER=1
111105
export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP
112106
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
113107
114-
python3 /usr/local/byteps/launcher/launch.py
108+
bpslaunch
115109
```
116110

117111

@@ -129,9 +123,7 @@ export DMLC_NUM_SERVER=1
129123
export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP
130124
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
131125
132-
python3 /usr/local/byteps/launcher/launch.py \
133-
python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
134-
--model ResNet50 --num-iters 1000000
126+
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
135127
```
136128

137129
For worker-1:
@@ -149,26 +141,20 @@ export DMLC_NUM_SERVER=1
149141
export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP
150142
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
151143
152-
python3 /usr/local/byteps/launcher/launch.py \
153-
python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
154-
--model ResNet50 --num-iters 1000000
144+
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
155145
```
156146

157147

158148
If your workers use PyTorch, you need to change the image name to `bytepsimage/pytorch`, and replace the python script of the workers with
159149

160150
```
161-
python3 /usr/local/byteps/launcher/launch.py \
162-
python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py \
163-
--model resnet50 --num-iters 1000000
151+
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000
164152
```
165153

166154

167155
If your workers use MXNet, you need to change the image name to `bytepsimage/mxnet`, and replace the python script of the workers with
168156
```
169-
python3 /usr/local/byteps/launcher/launch.py \
170-
python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py \
171-
--benchmark 1 --batch-size=32
157+
bpslaunch python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32
172158
```
173159

174160
## Distributed Training with RDMA
@@ -198,7 +184,7 @@ export DMLC_PS_ROOT_URI=10.0.0.100
198184
export DMLC_PS_ROOT_PORT=9000
199185
200186
# launch the job
201-
python3 /usr/local/byteps/launcher/launch.py
187+
bpslaunch
202188
```
203189

204190
For the server:
@@ -222,7 +208,7 @@ export DMLC_PS_ROOT_URI=10.0.0.100
222208
export DMLC_PS_ROOT_PORT=9000
223209
224210
# launch the job
225-
python3 /usr/local/byteps/launcher/launch.py
211+
bpslaunch
226212
```
227213

228214
For worker-0:
@@ -250,9 +236,7 @@ export DMLC_PS_ROOT_URI=10.0.0.100
250236
export DMLC_PS_ROOT_PORT=9000
251237
252238
# launch the job
253-
python3 /usr/local/byteps/launcher/launch.py \
254-
python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
255-
--model ResNet50 --num-iters 1000000
239+
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
256240
```
257241

258242
For worker-1:
@@ -281,25 +265,19 @@ export DMLC_PS_ROOT_URI=10.0.0.100
281265
export DMLC_PS_ROOT_PORT=9000
282266
283267
# launch the job
284-
python3 /usr/local/byteps/launcher/launch.py \
285-
python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
286-
--model ResNet50 --num-iters 1000000
268+
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
287269
```
288270

289271

290272

291273
If your workers use PyTorch, you need to change the image name to `bytepsimage/pytorch`, and replace the python script of the workers with
292274

293275
```
294-
python3 /usr/local/byteps/launcher/launch.py \
295-
python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py \
296-
--model resnet50 --num-iters 1000000
276+
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000
297277
```
298278

299279

300280
If your workers use MXNet, you need to change the image name to `bytepsimage/mxnet`, and replace the python script of the workers with
301281
```
302-
python3 /usr/local/byteps/launcher/launch.py \
303-
python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py \
304-
--benchmark 1 --batch-size=32
282+
bpslaunch python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32
305283
```

docs/troubleshooting.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ When launching distributed jobs, if you see hanging at the beginning, one possib
1111
Install ps-lite:
1212

1313
```
14-
git clone --branch byteps https://github.com/bytedance/ps-lite.git
14+
git clone -b byteps https://github.com/bytedance/ps-lite.git
1515
cd ps-lite
1616
make -j
1717
```
@@ -25,7 +25,7 @@ export DMLC_NUM_SERVER=1
2525
export DMLC_PS_ROOT_URI=[YOUR_SCHEDULER_IP]
2626
export DMLC_PS_ROOT_PORT=[YOUR_SCHEDULER_PORT]
2727
export DMLC_INTERFACE=eth0
28-
./ps-lite/tests/test_kv_app_benchmark
28+
./ps-lite/tests/test_benchmark
2929
```
3030

3131
For the server
@@ -36,7 +36,7 @@ export DMLC_NUM_SERVER=1
3636
export DMLC_PS_ROOT_URI=[YOUR_SCHEDULER_IP]
3737
export DMLC_PS_ROOT_PORT=[YOUR_SCHEDULER_PORT]
3838
export DMLC_INTERFACE=eth0
39-
./ps-lite/tests/test_kv_app_benchmark
39+
./ps-lite/tests/test_benchmark
4040
```
4141

4242
For the worker:
@@ -47,13 +47,13 @@ export DMLC_NUM_SERVER=1
4747
export DMLC_PS_ROOT_URI=[YOUR_SCHEDULER_IP]
4848
export DMLC_PS_ROOT_PORT=[YOUR_SCHEDULER_PORT]
4949
export DMLC_INTERFACE=eth0
50-
./ps-lite/tests/test_kv_app_benchmark 1024000 100 0
50+
./ps-lite/tests/test_benchmark 1024000 100 0
5151
```
5252

5353
If it succeed, you should be able to see something like this on the worker.
5454
```
55-
tests/test_kv_app_benchmark.cc:77: push_byte=4096000, repeat=100, total_time=128.842ms
56-
tests/test_kv_app_benchmark.cc:91: pull_byte=4096000, repeat=100, total_time=353.38ms
55+
push_byte=4096000, repeat=100, total_time=128.842ms
56+
pull_byte=4096000, repeat=100, total_time=353.38ms
5757
```
5858

5959
(Note: for RDMA networks, use `make -j USE_RDMA=1` to build, and `export DMLC_ENABLE_RDMA=1` for running the scheduler / server / worker)

0 commit comments

Comments
 (0)