Skip to content

Commit 6f40001

Browse files
IgorWilbertalsrgv
authored andcommitted
Docs migration to Sphinx (horovod#1088)
* rst migration: concepts Signed-off-by: Igor Wilbert <[email protected]> * rst migration: running horovod Signed-off-by: Igor Wilbert <[email protected]> * rst migration: benchmarks Signed-off-by: Igor Wilbert <[email protected]> * rst migration: docker Signed-off-by: Igor Wilbert <[email protected]> * rst migration: gpus Signed-off-by: Igor Wilbert <[email protected]> * rst migration: inference Signed-off-by: Igor Wilbert <[email protected]> * rst migration: spark Signed-off-by: Igor Wilbert <[email protected]> * rst migration: tensor-fusion Signed-off-by: Igor Wilbert <[email protected]> * rst migration: timeline Signed-off-by: Igor Wilbert <[email protected]> * rst migration: troubleshooting Signed-off-by: Igor Wilbert <[email protected]> * Changes requested in review 1 Signed-off-by: Igor Wilbert <[email protected]> * Changes requested in review 2 Signed-off-by: Igor Wilbert <[email protected]>
1 parent 45755f6 commit 6f40001

35 files changed

+1478
-902
lines changed

README.rst

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:
6969
:alt: 512-GPU Benchmark
7070

7171
Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.
72-
See the `Benchmarks <docs/benchmarks.md>`_ page to find out how to reproduce these numbers.
72+
See the `Benchmarks <docs/benchmarks.rst>`_ page to find out how to reproduce these numbers.
7373

7474
While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing
7575
with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at
@@ -88,20 +88,20 @@ downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.
8888

8989
2. Install the ``horovod`` pip package.
9090

91-
.. code-block:: python
91+
.. code-block:: bash
9292
93-
pip install horovod
93+
$ pip install horovod
9494
9595
This basic installation is good for laptops and for getting to know Horovod.
96-
If you're installing Horovod on a server with GPUs, read the `Horovod on GPU <docs/gpus.md>`_ page.
97-
If you want to use Docker, read the `Horovod in Docker <docs/docker.md>`_ page.
96+
If you're installing Horovod on a server with GPUs, read the `Horovod on GPU <docs/gpus.rst>`_ page.
97+
If you want to use Docker, read the `Horovod in Docker <docs/docker.rst>`_ page.
9898

9999

100100
Concepts
101101
--------
102102

103103
Horovod core principles are based on `MPI <http://mpi-forum.org/>`_ concepts such as *size*, *rank*,
104-
*local rank*, *allreduce*, *allgather* and, *broadcast*. See `this page <docs/concepts.md>`_ for more details.
104+
*local rank*, **allreduce**, **allgather** and, *broadcast*. See `this page <docs/concepts.rst>`_ for more details.
105105

106106

107107
Usage
@@ -119,7 +119,7 @@ To use Horovod, make the following additions to your program:
119119
the number of workers. An increase in learning rate compensates for the increased batch size.
120120

121121
4. Wrap optimizer in ``hvd.DistributedOptimizer``. The distributed optimizer delegates gradient computation
122-
to the original optimizer, averages gradients using *allreduce* or *allgather*, and then applies those averaged
122+
to the original optimizer, averages gradients using **allreduce** or **allgather**, and then applies those averaged
123123
gradients.
124124

125125
5. Add ``hvd.BroadcastGlobalVariablesHook(0)`` to broadcast initial variable states from rank 0 to all other processes.
@@ -131,7 +131,7 @@ To use Horovod, make the following additions to your program:
131131
This can be accomplished by passing ``checkpoint_dir=None`` to ``tf.train.MonitoredTrainingSession`` if
132132
``hvd.rank() != 0``.
133133

134-
Example (see the `examples <examples/>`_ directory for full training examples):
134+
Example (see the `examples <https://github.com/horovod/horovod/blob/master/examples/>`_ directory for full training examples):
135135

136136
.. code-block:: python
137137
@@ -177,34 +177,34 @@ Example (see the `examples <examples/>`_ directory for full training examples):
177177
Running Horovod
178178
---------------
179179

180-
The example commands below show how to run distributed training. See the `Running Horovod <docs/running.md>`_
180+
The example commands below show how to run distributed training. See the `Running Horovod <docs/running.rst>`_
181181
page for more instructions, including RoCE/InfiniBand tweaks and tips for dealing with hangs.
182182

183183
1. To run on a machine with 4 GPUs:
184184

185185
.. code-block:: bash
186186
187-
horovodrun -np 4 -H localhost:4 python train.py
187+
$ horovodrun -np 4 -H localhost:4 python train.py
188188
189189
2. To run on 4 machines with 4 GPUs each:
190190

191191
.. code-block:: bash
192192
193-
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
193+
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
194194
195195
3. To run using Open MPI without the ``horovodrun`` wrapper, see the `Running Horovod with Open MPI <docs/mpirun.rst>`_ page.
196196

197-
4. To run in Docker, see the `Horovod in Docker <docs/docker.md>`_ page.
197+
4. To run in Docker, see the `Horovod in Docker <docs/docker.rst>`_ page.
198198

199199
5. To run in Kubernetes, see `Kubeflow <https://github.com/kubeflow/kubeflow/tree/master/kubeflow/mpi-job>`_, `MPI Operator <https://github.com/kubeflow/mpi-operator/>`_, `Helm Chart <https://github.com/kubernetes/charts/tree/master/stable/horovod/>`_, and `FfDL <https://github.com/IBM/FfDL/tree/master/etc/examples/horovod/>`_.
200200

201-
6. To run in Spark, see the `Spark <docs/spark.md>`_ page.
201+
6. To run in Spark, see the `Spark <docs/spark.rst>`_ page.
202202

203203
Keras
204204
-----
205205
Horovod supports Keras and regular TensorFlow in similar ways.
206206

207-
See full training `simple <examples/keras_mnist.py>`_ and `advanced <examples/keras_mnist_advanced.py>`_ examples.
207+
See full training `simple <https://github.com/horovod/horovod/blob/master/examples/keras_mnist.py>`_ and `advanced <https://github.com/horovod/horovod/blob/master/examples/keras_mnist_advanced.py>`_ examples.
208208

209209
**Note**: Keras 2.0.9 has a `known issue <https://github.com/fchollet/keras/issues/8353>`_ that makes each worker allocate
210210
all GPUs on the server, instead of the GPU assigned by the *local rank*. If you have multiple GPUs per server, upgrade
@@ -221,7 +221,7 @@ MXNet
221221
-----
222222
Horovod supports MXNet and regular TensorFlow in similar ways.
223223

224-
See full training `MNIST <examples/mxnet_mnist.py>`_ and `ImageNet <examples/mxnet_imagenet_resnet50.py>`_ examples. The script below provides a simple skeleton of code block based on MXNet Gluon API.
224+
See full training `MNIST <https://github.com/horovod/horovod/blob/master/examples/mxnet_mnist.py>`_ and `ImageNet <https://github.com/horovod/horovod/blob/master/examples/mxnet_imagenet_resnet50.py>`_ examples. The script below provides a simple skeleton of code block based on MXNet Gluon API.
225225

226226
.. code-block:: python
227227
@@ -349,15 +349,15 @@ You can check for MPI multi-threading support by querying the ``hvd.mpi_threads_
349349
350350
Inference
351351
---------
352-
Learn how to optimize your model for inference and remove Horovod operations from the graph `here <docs/inference.md>`_.
352+
Learn how to optimize your model for inference and remove Horovod operations from the graph `here <docs/inference.rst>`_.
353353

354354

355355
Tensor Fusion
356356
-------------
357357
One of the unique things about Horovod is its ability to interleave communication and computation coupled with the ability
358-
to batch small *allreduce* operations, which results in improved performance. We call this batching feature Tensor Fusion.
358+
to batch small **allreduce** operations, which results in improved performance. We call this batching feature Tensor Fusion.
359359

360-
See `here <docs/tensor-fusion.md>`__ for full details and tweaking instructions.
360+
See `here <docs/tensor-fusion.rst>`__ for full details and tweaking instructions.
361361

362362

363363
Analyzing Horovod Performance
@@ -367,7 +367,7 @@ Horovod has the ability to record the timeline of its activity, called Horovod T
367367
.. image:: https://user-images.githubusercontent.com/16640218/29735271-9e148da0-89ac-11e7-9ae0-11d7a099ac89.png
368368
:alt: Horovod Timeline
369369

370-
See `here <docs/timeline.md>`__ for full details and usage instructions.
370+
See `here <docs/timeline.rst>`__ for full details and usage instructions.
371371

372372

373373
Guides
@@ -376,7 +376,7 @@ Guides
376376

377377
Troubleshooting
378378
---------------
379-
See the `Troubleshooting <docs/troubleshooting.md>`_ page and please submit a `ticket <https://github.com/uber/horovod/issues/new>`_
379+
See the `Troubleshooting <docs/troubleshooting.rst>`_ page and please submit a `ticket <https://github.com/uber/horovod/issues/new>`_
380380
if you can't find an answer.
381381

382382

docs/benchmarks.md

Lines changed: 0 additions & 53 deletions
This file was deleted.

docs/benchmarks.rst

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
2+
.. inclusion-marker-start-do-not-remove
3+
4+
5+
Benchmarks
6+
==========
7+
8+
9+
.. image:: https://user-images.githubusercontent.com/16640218/38965607-bf5c46ca-4332-11e8-895a-b9c137e86013.png
10+
:alt: 512-GPU Benchmark
11+
12+
13+
The above benchmark was done on 128 servers with 4 Pascal GPUs each connected by a RoCE-capable 25 Gbit/s network. Horovod
14+
achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.
15+
16+
To reproduce the benchmarks:
17+
18+
1. Install Horovod using the instructions provided on the `Horovod on GPU <https://github.com/horovod/horovod/blob/master/docs/gpus.rst>`__ page.
19+
20+
2. Clone `https://github.com/tensorflow/benchmarks <https://github.com/tensorflow/benchmarks>`__
21+
22+
.. code-block:: bash
23+
24+
$ git clone https://github.com/tensorflow/benchmarks
25+
$ cd benchmarks
26+
27+
28+
3. Run the benchmark. Examples below are for Open MPI.
29+
30+
.. code-block:: bash
31+
32+
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
33+
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
34+
--model resnet101 \
35+
--batch_size 64 \
36+
--variable_update horovod
37+
38+
39+
4. At the end of the run, you will see the number of images processed per second:
40+
41+
.. code-block:: bash
42+
43+
total images/sec: 1656.82
44+
45+
46+
**Real data benchmarks**
47+
48+
The benchmark instructions above are for the synthetic data benchmark.
49+
50+
To run the benchmark on a real data, you need to download the `ImageNet dataset <http://image-net.org/download-images>`__
51+
and convert it using the TFRecord `preprocessing script <https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh>`__.
52+
53+
Now, simply add ``--data_dir /path/to/imagenet/tfrecords --data_name imagenet --num_batches=2000`` to your training command:
54+
55+
.. code-block:: bash
56+
57+
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
58+
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
59+
--model resnet101 \
60+
--batch_size 64 \
61+
--variable_update horovod \
62+
--data_dir /path/to/imagenet/tfrecords \
63+
--data_name imagenet \
64+
--num_batches=2000
65+
66+
67+
68+
.. inclusion-marker-end-do-not-remove

docs/benchmarks_include.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.. include:: ./benchmarks.rst
2+
:start-after: inclusion-marker-start-do-not-remove
3+
:end-before: inclusion-marker-end-do-not-remove

docs/concepts.md

Lines changed: 0 additions & 27 deletions
This file was deleted.

docs/concepts.rst

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
2+
.. inclusion-marker-start-do-not-remove
3+
4+
5+
Concepts
6+
========
7+
8+
Horovod core principles are based on the `MPI <http://mpi-forum.org/>`_ concepts *size*, *rank*,
9+
*local rank*, *allreduce*, *allgather*, and *broadcast*. These are best explained by example. Say we launched
10+
a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:
11+
12+
* *Size* would be the number of processes, in this case, 16.
13+
14+
* *Rank* would be the unique process ID from 0 to 15 (*size* - 1).
15+
16+
* *Local rank* would be the unique process ID within the server from 0 to 3.
17+
18+
* *Allreduce* is an operation that aggregates data among multiple processes and distributes results back to them. *Allreduce* is used to average dense tensors. Here's an illustration from the `MPI Tutorial <http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/>`__:
19+
20+
.. image:: http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/mpi_allreduce_1.png
21+
:alt: Allreduce Illustration
22+
23+
* *Allgather* is an operation that gathers data from all processes on every process. *Allgather* is used to collect values of sparse tensors. Here's an illustration from the `MPI Tutorial <http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/>`__:
24+
25+
.. image:: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/allgather.png
26+
:alt: Allgather Illustration
27+
28+
29+
* *Broadcast* is an operation that broadcasts data from one process, identified by root rank, onto every other process. Here's an illustration from the `MPI Tutorial <http://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/>`__:
30+
31+
.. image:: http://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/broadcast_pattern.png
32+
:alt: Broadcast Illustration
33+
34+
35+
.. inclusion-marker-end-do-not-remove

docs/concepts_include.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.. include:: ./concepts.rst
2+
:start-after: inclusion-marker-start-do-not-remove
3+
:end-before: inclusion-marker-end-do-not-remove

docs/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@
4141
'sphinx.ext.autodoc',
4242
'sphinx.ext.viewcode',
4343
'sphinxcontrib.napoleon',
44+
'nbsphinx',
4445
]
4546

4647
# Add any paths that contain templates here, relative to this directory.

0 commit comments

Comments
 (0)