Skip to content

Commit 62c2314

Browse files
authored
Add conda env instructions (horovod#2060)
Signed-off-by: pughdr <[email protected]>
1 parent 271c52e commit 62c2314

File tree

6 files changed

+288
-0
lines changed

6 files changed

+288
-0
lines changed

README.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,8 @@ For more details on installing Horovod with GPU support, read `Horovod on GPU <d
131131

132132
For the full list of Horovod installation options, read the `Installation Guide <docs/install.rst>`_.
133133

134+
If you want to use Conda, read `Building a Conda environment with GPU support for Horovod <conda.rst>`_.
135+
134136
If you want to use Docker, read `Horovod in Docker <docs/docker.rst>`_.
135137

136138
To compile Horovod from source, follow the instructions in the `Contributor Guide <docs/contributors.rst>`_.

docs/conda.rst

Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
.. inclusion-marker-start-do-not-remove
2+
3+
Build a Conda Environment with GPU Support for Horovod
4+
======================================================
5+
6+
In this section we describe how to build Conda environments for deep learning projects using
7+
Horovod to enable distributed training across multiple GPUs (either on the same node or
8+
spread across multuple nodes).
9+
10+
Installing the NVIDIA CUDA Toolkit
11+
----------------------------------
12+
13+
Install `NVIDIA CUDA Toolkit 10.1`_ (`documentation`_) which is the most recent version of NVIDIA
14+
CUDA Toolkit supported by all three deep learning frameworks that are currently supported by
15+
Horovod.
16+
17+
Why not just use the ``cudatoolkit`` package?
18+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
19+
20+
Typically when installing PyTorch, TensorFlow, or Apache MXNet with GPU support using Conda, you
21+
add the appropriate version of the ``cudatoolkit`` package to your ``environment.yml`` file.
22+
Unfortunately, for the moment at least, the cudatoolkit packages available via Conda do not
23+
include the `NVIDIA CUDA Compiler (NVCC)`_, which is required in order to build Horovod extensions
24+
for PyTorch, TensorFlow, or MXNet.
25+
26+
What about the ``cudatoolkit-dev`` package?
27+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
28+
29+
While there are ``cudatoolkit-dev`` packages available from ``conda-forge`` that do include NVCC,
30+
we have had difficulty getting these packages to consistently install properly. Some of the
31+
available builds require manual intervention to accept license agreements, making these builds
32+
unsuitable for installing on remote systems (which is critical functionality). Other builds seems
33+
to work on Ubuntu but not on other flavors of Linux.
34+
35+
Despite this, we would encourage you to try adding ``cudatoolkit-dev`` to your ``environment.yml``
36+
file and see what happens! The package is well maintained so perhaps it will become more stable in
37+
the future.
38+
39+
Use the ``nvcc_linux-64`` meta-package
40+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
41+
42+
The most robust approach to obtain NVCC and still use Conda to manage all the other dependencies
43+
is to install the NVIDIA CUDA Toolkit on your system and then install a meta-package
44+
`nvcc_linux-64`_ from conda-forge, which configures your Conda environment to use the NVCC
45+
installed on the system together with the other CUDA Toolkit components installed inside the Conda
46+
environment.
47+
48+
The ``environment.yml`` file
49+
----------------------------
50+
51+
We prefer to specify as many dependencies as possible in the Conda ``environment.yml`` file
52+
and only specify dependencies in ``requirements.txt`` for install via ``pip`` that are not
53+
available via Conda channels. Check the Horovod `installation guide`_ for details of required
54+
dependencies.
55+
56+
Channel Priority
57+
^^^^^^^^^^^^^^^^
58+
59+
Use the recommended channel priorities. Note that ``conda-forge`` has priority over
60+
``defaults`` and ``pytorch`` has priority over ``conda-forge``. ::
61+
62+
name: null
63+
64+
channels:
65+
- pytorch
66+
- conda-forge
67+
- defaults
68+
69+
Dependencies
70+
^^^^^^^^^^^^
71+
72+
There are a few things worth noting about the dependencies.
73+
74+
1. Even though you have installed the NVIDIA CUDA Toolkit manually, you should still use Conda to
75+
manage the other required CUDA components such as ``cudnn`` and ``nccl`` (and the optional
76+
``cupti``).
77+
2. Use two meta-packages, ``cxx-compiler`` and ``nvcc_linux-64``, to make sure that suitable C,
78+
and C++ compilers are installed and that the resulting Conda environment is aware of the
79+
manually installed CUDA Toolkit.
80+
3. Horovod requires some controller library to coordinate work between the various Horovod
81+
processes. Typically this will be some MPI implementation such as `OpenMPI`_. However, rather
82+
than specifying the ``openmpi`` package directly, you should instead opt for `mpi4py`_ Conda
83+
package which provides a CUDA-aware build of OpenMPI.
84+
4. Horovod also support the `Gloo`_ collective communications library that can be used in place of
85+
MPI. Include ``cmake`` to insure that the Horovod extensions for Gloo are built.
86+
87+
Below are the core required dependencies. The complete ``environment.yml`` file is available
88+
on GitHub. ::
89+
90+
dependencies:
91+
- bokeh=1.4
92+
- cmake=3.16 # insures that Gloo library extensions will be built
93+
- cudnn=7.6
94+
- cupti=10.1
95+
- cxx-compiler=1.0 # insures C and C++ compilers are available
96+
- jupyterlab=1.2
97+
- mpi4py=3.0 # installs cuda-aware openmpi
98+
- nccl=2.5
99+
- nodejs=13
100+
- nvcc_linux-64=10.1 # configures environment to be "cuda-aware"
101+
- pip=20.0
102+
- pip:
103+
- mxnet-cu101mkl==1.6.* # MXNET is installed prior to horovod
104+
- -r file:requirements.txt
105+
- python=3.7
106+
- pytorch=1.5
107+
- tensorboard=2.1
108+
- tensorflow-gpu=2.1
109+
- torchvision=0.6
110+
111+
The ``requirements.txt`` file
112+
-----------------------------
113+
114+
The ``requirements.txt`` file is where all of the ``pip`` dependencies, including Horovod itself,
115+
are listed for installation. In addition to Horovod we typically will also use ``pip`` to install
116+
JupyterLab extensions to enable GPU and CPU resource monitoring via `jupyterlab-nvdashboard`_ and
117+
Tensorboard support via `jupyter-tensorboard`_. ::
118+
119+
horovod==0.19.*
120+
jupyterlab-nvdashboard==0.2.*
121+
jupyter-tensorboard==0.2.*
122+
123+
# make sure horovod is re-compiled if environment is re-built
124+
--no-binary=horovod
125+
126+
Note the use of the ``--no-binary`` option at the end of the file. Including this option ensures
127+
that Horovod will be re-built whenever the Conda environment is re-built.
128+
129+
Building the Conda environment
130+
------------------------------
131+
132+
After adding any necessary dependencies that should be downloaded via Conda to the
133+
``environment.yml`` file and any dependencies that should be downloaded via ``pip`` to the
134+
``requirements.txt`` file, create the Conda environment in a sub-directory ``env`` of your
135+
project directory by running the following commands.
136+
137+
.. code-block:: bash
138+
139+
$ export ENV_PREFIX=$PWD/env
140+
$ export HOROVOD_CUDA_HOME=$CUDA_HOME
141+
$ export HOROVOD_NCCL_HOME=$ENV_PREFIX
142+
$ export HOROVOD_GPU_OPERATIONS=NCCL
143+
$ conda env create --prefix $ENV_PREFIX --file environment.yml --force
144+
145+
By default Horovod will try and build extensions for all detected frameworks. See the
146+
documentation on `environment variables`_ for the details on additional environment variables that
147+
can be set prior to building Horovod.
148+
149+
Once the new environment has been created you can activate the environment with the following
150+
command.
151+
152+
.. code-block:: bash
153+
154+
$ conda activate $ENV_PREFIX
155+
156+
The ``postBuild`` file
157+
^^^^^^^^^^^^^^^^^^^^^^
158+
159+
If you wish to use any JupyterLab extensions included in the ``environment.yml`` and
160+
``requirements.txt`` files, then you may need to rebuild the JupyterLab application.
161+
162+
For simplicity, we typically include the instructions for re-building JupyterLab in a
163+
``postBuild`` script. Here is what this script looks like in the example Horovod environments.
164+
165+
.. code-block:: bash
166+
167+
jupyter labextension install --no-build jupyterlab-nvdashboard
168+
jupyter labextension install --no-build jupyterlab_tensorboard
169+
jupyter lab build
170+
171+
Use the following commands to source the ``postBuild`` script.
172+
173+
.. code-block:: bash
174+
175+
$ conda activate $ENV_PREFIX # optional if environment already active
176+
$ . postBuild
177+
178+
Listing the contents of the Conda environment
179+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
180+
To see the full list of packages installed into the environment, run the following command.
181+
182+
.. code-block:: bash
183+
184+
$ conda activate $ENV_PREFIX # optional if environment already active
185+
$ conda list
186+
187+
Verifying the Conda environment
188+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
189+
190+
After building the Conda environment, check that Horovod has been built with support for the deep
191+
learning frameworks TensorFlow, PyTorch, Apache MXNet, and the contollers MPI and Gloo with the
192+
following command.
193+
194+
.. code-block:: bash
195+
196+
$ conda activate $ENV_PREFIX # optional if environment already active
197+
$ horovodrun --check-build
198+
199+
You should see output similar to the following.::
200+
201+
Horovod v0.19.4:
202+
Available Frameworks:
203+
[X] TensorFlow
204+
[X] PyTorch
205+
[X] MXNet
206+
Available Controllers:
207+
[X] MPI
208+
[X] Gloo
209+
Available Tensor Operations:
210+
[X] NCCL
211+
[ ] DDL
212+
[ ] CCL
213+
[X] MPI
214+
[X] Gloo
215+
216+
Wrapping it all up in a Bash script
217+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
218+
219+
We typically wrap these commands into a shell script ``create-conda-env.sh``. Running the shell
220+
script will set the Horovod build variables, create the Conda environment, activate the Conda
221+
environment, and build JupyterLab with any additional extensions.
222+
223+
.. code-block:: bash
224+
225+
#!/bin/bash --login
226+
227+
set -e
228+
229+
export ENV_PREFIX=$PWD/env
230+
export HOROVOD_CUDA_HOME=$CUDA_HOME
231+
export HOROVOD_NCCL_HOME=$ENV_PREFIX
232+
export HOROVOD_GPU_OPERATIONS=NCCL
233+
conda env create --prefix $ENV_PREFIX --file environment.yml --force
234+
conda activate $ENV_PREFIX
235+
. postBuild
236+
237+
We recommend that you put scripts inside a ``bin`` directory in your project root directory. Run
238+
the script from the project root directory as follows.
239+
240+
.. code-block:: bash
241+
242+
./bin/create-conda-env.sh # assumes that $CUDA_HOME is set properly
243+
244+
Updating the Conda environment
245+
------------------------------
246+
247+
If you add (remove) dependencies to (from) the ``environment.yml`` file or the
248+
``requirements.txt`` file after the environment has already been created, then you can
249+
re-create the environment with the following command.
250+
251+
.. code-block:: bash
252+
253+
$ conda env create --prefix $ENV_PREFIX --file environment.yml --force
254+
255+
However, whenever we add (remove) dependencies we prefer to re-run the Bash script which will re-build
256+
both the Conda environment and JupyterLab.
257+
258+
.. code-block:: bash
259+
260+
$ ./bin/create-conda-env.sh
261+
262+
.. _NVIDIA CUDA Toolkit 10.1: https://developer.nvidia.com/cuda-10.1-download-archive-update2
263+
.. _documentation: https://docs.nvidia.com/cuda/archive/10.1/
264+
.. _NVIDIA CUDA Compiler (NVCC): https://docs.nvidia.com/cuda/archive/10.1/cuda-compiler-driver-nvcc/index.html
265+
.. _nvcc_linux-64: https://github.com/conda-forge/nvcc-feedstock
266+
.. _installation guide: https://horovod.readthedocs.io/en/latest/install_include.html
267+
.. _OpenMPI: https://www.open-mpi.org/
268+
.. _mpi4py: https://mpi4py.readthedocs.io/en/stable/
269+
.. _Gloo: https://github.com/facebookincubator/gloo
270+
.. _jupyterlab-nvdashboard: https://github.com/rapidsai/jupyterlab-nvdashboard
271+
.. _jupyter-tensorboard: https://github.com/lspvic/jupyter_tensorboard
272+
.. _environment variables: https://horovod.readthedocs.io/en/latest/install_include.html#environment-variables
273+
274+
.. inclusion-marker-end-do-not-remove

docs/conda_include.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.. include:: ./conda.rst
2+
:start-after: inclusion-marker-start-do-not-remove
3+
:end-before: inclusion-marker-end-do-not-remove

docs/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,8 @@ Guides
117117

118118
gpus_include
119119

120+
conda_include
121+
120122
docker_include
121123

122124
spark_include

docs/install.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -200,6 +200,11 @@ diagnose failures:
200200
$ pip uninstall horovod
201201
$ HOROVOD_WITH_...=1 pip install --no-cache-dir horovod
202202
203+
Installing Horovod with Conda (+pip)
204+
------------------------------------
205+
206+
To use Conda to install PyTorch, TensorFlow, MXNet, Horovod, as well as GPU depdencies such as
207+
NVIDIA CUDA Toolkit, cuDNN, NCCL, etc., see `Build a Conda Environment with GPU Support for Horovod <conda.rst>`_.
203208

204209
Environment Variables
205210
---------------------

docs/summary.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,8 @@ For more details on installing Horovod with GPU support, read `Horovod on GPU <g
123123

124124
For the full list of Horovod installation options, read the `Installation Guide <install.rst>`_.
125125

126+
If you want to use Conda, read `Building a Conda environment with GPU support for Horovod <conda.rst>`_.
127+
126128
If you want to use Docker, read `Horovod in Docker <docker.rst>`_.
127129

128130
To compile Horovod from source, follow the instructions in the `Contributor Guide <contributors.rst>`_.

0 commit comments

Comments
 (0)