Skip to content

Commit ffd9310

Browse files
authored
Added HOROVOD_GPU_OPERATIONS installation variable (horovod#1960)
Signed-off-by: Travis Addair <[email protected]>
1 parent 2dc0553 commit ffd9310

20 files changed

+108
-84
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
1010

1111
- Added bare-metal elastic mode implementation to enable auto-scaling and fault tolerance. ([#1849](https://github.com/horovod/horovod/pull/1849))
1212

13+
- Added NCCL implementation of the allgather operation. ([#1952](https://github.com/horovod/horovod/pull/1952))
14+
15+
- Added `HOROVOD_GPU_OPERATIONS` installation variable to simplify enabling NCCL support for all GPU operations. ([#1960](https://github.com/horovod/horovod/pull/1960))
16+
1317
### Changed
1418

1519
### Deprecated

Dockerfile.gpu

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ RUN mkdir /tmp/openmpi && \
6767

6868
# Install Horovod, temporarily using CUDA stubs
6969
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
70-
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
70+
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
7171
pip install --no-cache-dir horovod && \
7272
ldconfig
7373

Dockerfile.test.gpu

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ ARG PYTORCH_PACKAGE=torch==1.2.0
1414
ARG TORCHVISION_PACKAGE=torchvision==0.4.0
1515
ARG MXNET_PACKAGE=mxnet-cu100==1.5.0
1616
ARG PYSPARK_PACKAGE=pyspark==2.4.0
17-
ARG HOROVOD_BUILD_FLAGS="HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL"
17+
ARG HOROVOD_BUILD_FLAGS="HOROVOD_GPU_OPERATIONS=NCCL"
1818
ARG HOROVOD_MIXED_INSTALL=0
1919

2020
# Set default shell to /bin/bash

README.rst

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,8 +52,13 @@ about who's involved and how Horovod plays a role, read the LF AI `announcement
5252
5353
.. contents::
5454

55+
|
56+
57+
Documentation
58+
-------------
5559

56-
The full documentation and an API reference are published at https://horovod.readthedocs.io/en/latest.
60+
- `Latest Release <https://horovod.readthedocs.io/en/stable>`_
61+
- `master <https://horovod.readthedocs.io/en/latest>`_
5762

5863
|
5964
@@ -118,7 +123,7 @@ To install Horovod:
118123

119124
.. code-block:: bash
120125
121-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod
126+
$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
122127
123128
This basic installation is good for laptops and for getting to know Horovod.
124129

docs/adasum_user_guide.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ Below are the requirements for running Horovod with AdaSum:
7777

7878
*Using NCCL:*
7979

80-
If the **HOROVOD_GPU_ALLREDUCE=NCCL** flag is used to compile Horovod, NCCL is used instead. In this case, NCCL will be used for intra-node communication, and AdaSum will be used for inter-node communication.
80+
If the **HOROVOD_GPU_OPERATIONS=NCCL** flag is used to compile Horovod, NCCL is used instead. In this case, NCCL will be used for intra-node communication, and AdaSum will be used for inter-node communication.
8181

8282
Modes of Operation
8383
=====================
@@ -207,10 +207,10 @@ Key Takeaways
207207

208208
- As the number of ranks scales up, the learning rate does not need to be scaled linearly if using CPU to do AdaSum reduction. A good scaling factor would be between 2\-2.5 over the best learning rate for a single worker.
209209

210-
- If the HOROVOD_GPU_ALLREDUCE=NCCL flag is used to compile Horovod, the learning rate that should be used is equal to the best learning rate for a single worker (GPU) scaled by the number of GPUs locally on a node. On very large clusters, scaling this even more by another factor of 1.5\-2.0x might give better results but is not guaranteed and should be tried only if scaling by just the local size is not sufficient for good convergence.
210+
- If the HOROVOD_GPU_OPERATIONS=NCCL flag is used to compile Horovod, the learning rate that should be used is equal to the best learning rate for a single worker (GPU) scaled by the number of GPUs locally on a node. On very large clusters, scaling this even more by another factor of 1.5\-2.0x might give better results but is not guaranteed and should be tried only if scaling by just the local size is not sufficient for good convergence.
211211

212212
- Pytorch training in fp16 format is not yet supported. Integration of Apex into the new optimizer to enabled full mixed precision training with AdaSum in Pytorch is a work in progress.
213213

214-
- When HOROVOD_GPU_ALLREDUCE=NCCL flag is used to compile Horovod and training is run on a single node, only averaging through NCCL library is used to perform reductions and no AdaSum algorithm will take place in this configuration.
214+
- When HOROVOD_GPU_OPERATIONS=NCCL flag is used to compile Horovod and training is run on a single node, only averaging through NCCL library is used to perform reductions and no AdaSum algorithm will take place in this configuration.
215215

216216
.. inclusion-marker-end-do-not-remove

docs/gpus.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,20 +45,20 @@ by installing an `nv_peer_memory <https://github.com/Mellanox/nv_peer_memory>`__
4545

4646
.. code-block:: bash
4747
48-
$ HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL pip install --no-cache-dir horovod
48+
$ HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
4949
5050
5151
If you installed NCCL 2 using the Ubuntu package, you can run:
5252

5353
.. code-block:: bash
5454
55-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL pip install --no-cache-dir horovod
55+
$ HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
5656
5757
If you installed NCCL 2 using the `CentOS / RHEL package <https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html#rhel_centos>`__, you can run:
5858

5959
.. code-block:: bash
6060
61-
$ HOROVOD_NCCL_INCLUDE=/usr/include HOROVOD_NCCL_LIB=/usr/lib64 HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL pip install --no-cache-dir horovod
61+
$ HOROVOD_NCCL_INCLUDE=/usr/include HOROVOD_NCCL_LIB=/usr/lib64 HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
6262
6363
6464
**Note**: Some models with a high computation to communication ratio benefit from doing allreduce on CPU, even if a
@@ -87,7 +87,7 @@ configure Horovod to use them as well:
8787

8888
.. code-block:: bash
8989
90-
$ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip install --no-cache-dir horovod
90+
$ HOROVOD_GPU_OPERATIONS=MPI pip install --no-cache-dir horovod
9191
9292
9393
**Note**: Allgather allocates an output tensor which is proportionate to the number of processes participating in the

docs/install.rst

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -152,8 +152,8 @@ will be used for CPU operations. You can override this by setting ``HOROVOD_CPU_
152152
NCCL
153153
~~~~
154154

155-
NCCL is currently supported for Allreduce and Broadcast operations. You can enable these by setting
156-
``HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL`` during installation.
155+
NCCL is supported for Allreduce, Allgather, and Broadcast operations. You can enable these by setting
156+
``HOROVOD_GPU_OPERATIONS=NCCL`` during installation.
157157

158158
NCCL operations are supported on both Nvidia (CUDA) and AMD (ROCm) GPUs. You can set ``HOROVOD_GPU`` in your
159159
environment to specify building with CUDA or ROCm. CUDA will be assumed if not specified.
@@ -223,8 +223,9 @@ Possible values are given in curly brackets: {}.
223223
* ``HOROVOD_WITH_MPI`` - {1}. Require that Horovod is built with MPI support enabled.
224224
* ``HOROVOD_WITHOUT_MPI`` - {1}. Skip building with MPI support.
225225
* ``HOROVOD_GPU`` - {CUDA, ROCM}. Framework to use for GPU operations.
226-
* ``HOROVOD_GPU_ALLREDUCE`` - {NCCL, MPI, DDL}. Framework to use for GPU tensor allreduce.
227-
* ``HOROVOD_GPU_ALLGATHER`` - {MPI}. Framework to use for GPU tensor allgather.
226+
* ``HOROVOD_GPU_OPERATIONS`` - {NCCL, MPI}. Framework to use for GPU tensor allreduce, allgather, and broadcast.
227+
* ``HOROVOD_GPU_ALLREDUCE`` - {NCCL, MPI}. Framework to use for GPU tensor allreduce.
228+
* ``HOROVOD_GPU_ALLGATHER`` - {NCCL, MPI}. Framework to use for GPU tensor allgather.
228229
* ``HOROVOD_GPU_BROADCAST`` - {NCCL, MPI}. Framework to use for GPU tensor broadcast.
229230
* ``HOROVOD_ALLOW_MIXED_GPU_IMPL`` - {1}. Allow Horovod to install with NCCL allreduce and MPI GPU allgather / broadcast. Not recommended due to a possible deadlock.
230231
* ``HOROVOD_CPU_OPERATIONS`` - {MPI, GLOO, CCL}. Framework to use for CPU tensor allreduce, allgather, and broadcast.

docs/summary.rst

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,9 +52,6 @@ about who's involved and how Horovod plays a role, read the LF AI `announcement
5252
5353
.. contents::
5454

55-
56-
The full documentation and an API reference are published at https://horovod.readthedocs.io/en/latest.
57-
5855
|
5956
6057
Why Horovod?
@@ -118,7 +115,7 @@ To install Horovod:
118115

119116
.. code-block:: bash
120117
121-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod
118+
$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
122119
123120
This basic installation is good for laptops and for getting to know Horovod.
124121

docs/summary.rst.patch

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
57,64d56
2+
< Documentation
3+
< -------------
4+
<
5+
< - `Latest Release <https://horovod.readthedocs.io/en/stable>`_
6+
< - `master <https://horovod.readthedocs.io/en/latest>`_
7+
<
8+
< |
9+
<

docs/troubleshooting.rst

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ To use CUDA stub drivers:
5656
$ ldconfig /usr/local/cuda/lib64/stubs
5757
5858
# install Horovod, add other HOROVOD_* environment variables as necessary
59-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
59+
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
6060
6161
# revert to standard libraries
6262
$ ldconfig
@@ -90,7 +90,7 @@ To use custom MPI directory:
9090
.. code-block:: bash
9191
9292
$ export PATH=$PATH:/path/to/mpi/bin
93-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
93+
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
9494
9595
9696
2. Are MPI libraries added to ``$LD_LIBRARY_PATH`` or ``ld.so.conf``?
@@ -202,14 +202,14 @@ For example:
202202
203203
.. code-block:: bash
204204
205-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
205+
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
206206
207207
208208
Or:
209209
210210
.. code-block:: bash
211211
212-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_INCLUDE=/path/to/nccl/include HOROVOD_NCCL_LIB=/path/to/nccl/lib pip install --no-cache-dir horovod
212+
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_INCLUDE=/path/to/nccl/include HOROVOD_NCCL_LIB=/path/to/nccl/lib pip install --no-cache-dir horovod
213213
214214
215215
Pip install: no such option: --no-cache-dir
@@ -237,7 +237,7 @@ For example:
237237
238238
.. code-block:: bash
239239
240-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
240+
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
241241
242242
243243
ncclAllReduce failed: invalid data type
@@ -260,7 +260,7 @@ the package and reinstall Horovod:
260260
261261
$ conda remove nccl
262262
$ pip uninstall -y horovod
263-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
263+
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
264264
265265
266266
transport/p2p.cu:431 WARN failed to open CUDA IPC handle : 30 unknown error
@@ -322,15 +322,15 @@ To build Horovod with a specific CUDA version, use the ``HOROVOD_CUDA_HOME`` env
322322
.. code-block:: bash
323323
324324
$ pip uninstall -y horovod
325-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl HOROVOD_CUDA_HOME=/path/to/cuda pip install --no-cache-dir horovod
325+
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl HOROVOD_CUDA_HOME=/path/to/cuda pip install --no-cache-dir horovod
326326
327327
328328
Alternatively, you can use the ``HOROVOD_CUDA_INCLUDE`` and ``HOROVOD_CUDA_LIB`` environment variables to specify the CUDA library to use:
329329
330330
.. code-block:: bash
331331
332332
$ pip uninstall -y horovod
333-
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl HOROVOD_CUDA_INCLUDE=/path/to/cuda/include HOROVOD_CUDA_LIB=/path/to/cuda/lib64 pip install --no-cache-dir horovod
333+
$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl HOROVOD_CUDA_INCLUDE=/path/to/cuda/include HOROVOD_CUDA_LIB=/path/to/cuda/lib64 pip install --no-cache-dir horovod
334334
335335
336336
FORCE-TERMINATE AT Data unpack would read past end of buffer

0 commit comments

Comments
 (0)