WeichenXu123
diff --git a/‎README.rst
Lines changed: 20 additions & 20 deletions b/‎README.rst
Lines changed: 20 additions & 20 deletions
diff --git a/‎docs/benchmarks.md
Lines changed: 0 additions & 53 deletions b/‎docs/benchmarks.md
Lines changed: 0 additions & 53 deletions
diff --git a/‎docs/benchmarks.rst
Lines changed: 68 additions & 0 deletions b/‎docs/benchmarks.rst
Lines changed: 68 additions & 0 deletions
diff --git a/‎docs/benchmarks_include.rst
Lines changed: 3 additions & 0 deletions b/‎docs/benchmarks_include.rst
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/concepts.md
Lines changed: 0 additions & 27 deletions b/‎docs/concepts.md
Lines changed: 0 additions & 27 deletions
diff --git a/‎docs/concepts.rst
Lines changed: 35 additions & 0 deletions b/‎docs/concepts.rst
Lines changed: 35 additions & 0 deletions
diff --git a/‎docs/concepts_include.rst
Lines changed: 3 additions & 0 deletions b/‎docs/concepts_include.rst
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/conf.py
Lines changed: 1 addition & 0 deletions b/‎docs/conf.py
Lines changed: 1 addition & 0 deletions
@@ -69,7 +69,7 @@ servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:
    :alt: 512-GPU Benchmark
 
 Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.
-See the `Benchmarks <docs/benchmarks.md>`_ page to find out how to reproduce these numbers.
+See the `Benchmarks <docs/benchmarks.rst>`_ page to find out how to reproduce these numbers.
 
 While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing
 with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at
@@ -88,20 +88,20 @@ downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.
 
 2. Install the ``horovod`` pip package.
 
-.. code-block:: python
+.. code-block:: bash
 
-    pip install horovod
+    $ pip install horovod
 
 This basic installation is good for laptops and for getting to know Horovod.
-If you're installing Horovod on a server with GPUs, read the `Horovod on GPU <docs/gpus.md>`_ page.
-If you want to use Docker, read the `Horovod in Docker <docs/docker.md>`_ page.
+If you're installing Horovod on a server with GPUs, read the `Horovod on GPU <docs/gpus.rst>`_ page.
+If you want to use Docker, read the `Horovod in Docker <docs/docker.rst>`_ page.
 
 
 Concepts
 --------
 
 Horovod core principles are based on `MPI <http://mpi-forum.org/>`_ concepts such as *size*, *rank*,
-*local rank*, *allreduce*, *allgather* and, *broadcast*. See `this page <docs/concepts.md>`_ for more details.
+*local rank*, **allreduce**, **allgather** and, *broadcast*. See `this page <docs/concepts.rst>`_ for more details.
 
 
 Usage
@@ -119,7 +119,7 @@ To use Horovod, make the following additions to your program:
    the number of workers. An increase in learning rate compensates for the increased batch size.
 
 4. Wrap optimizer in ``hvd.DistributedOptimizer``.  The distributed optimizer delegates gradient computation
-   to the original optimizer, averages gradients using *allreduce* or *allgather*, and then applies those averaged
+   to the original optimizer, averages gradients using **allreduce** or **allgather**, and then applies those averaged
    gradients.
 
 5. Add ``hvd.BroadcastGlobalVariablesHook(0)`` to broadcast initial variable states from rank 0 to all other processes.
@@ -131,7 +131,7 @@ To use Horovod, make the following additions to your program:
    This can be accomplished by passing ``checkpoint_dir=None`` to ``tf.train.MonitoredTrainingSession`` if
    ``hvd.rank() != 0``.
 
-Example (see the `examples <examples/>`_ directory for full training examples):
+Example (see the `examples <https://github.com/horovod/horovod/blob/master/examples/>`_ directory for full training examples):
 
 .. code-block:: python
 
@@ -177,34 +177,34 @@ Example (see the `examples <examples/>`_ directory for full training examples):
 Running Horovod
 ---------------
 
-The example commands below show how to run distributed training. See the `Running Horovod <docs/running.md>`_
+The example commands below show how to run distributed training. See the `Running Horovod <docs/running.rst>`_
 page for more instructions, including RoCE/InfiniBand tweaks and tips for dealing with hangs.
 
 1. To run on a machine with 4 GPUs:
 
 .. code-block:: bash
 
-    horovodrun -np 4 -H localhost:4 python train.py
+     $ horovodrun -np 4 -H localhost:4 python train.py
 
 2. To run on 4 machines with 4 GPUs each:
 
 .. code-block:: bash
 
-    horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
+    $ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
 
 3. To run using Open MPI without the ``horovodrun`` wrapper, see the `Running Horovod with Open MPI <docs/mpirun.rst>`_ page.
 
-4. To run in Docker, see the `Horovod in Docker <docs/docker.md>`_ page.
+4. To run in Docker, see the `Horovod in Docker <docs/docker.rst>`_ page.
 
 5. To run in Kubernetes, see `Kubeflow <https://github.com/kubeflow/kubeflow/tree/master/kubeflow/mpi-job>`_, `MPI Operator <https://github.com/kubeflow/mpi-operator/>`_, `Helm Chart <https://github.com/kubernetes/charts/tree/master/stable/horovod/>`_, and `FfDL <https://github.com/IBM/FfDL/tree/master/etc/examples/horovod/>`_.
 
-6. To run in Spark, see the `Spark <docs/spark.md>`_ page.
+6. To run in Spark, see the `Spark <docs/spark.rst>`_ page.
 
 Keras
 -----
 Horovod supports Keras and regular TensorFlow in similar ways.
 
-See full training `simple <examples/keras_mnist.py>`_ and `advanced <examples/keras_mnist_advanced.py>`_ examples.
+See full training `simple <https://github.com/horovod/horovod/blob/master/examples/keras_mnist.py>`_ and `advanced <https://github.com/horovod/horovod/blob/master/examples/keras_mnist_advanced.py>`_ examples.
 
 **Note**: Keras 2.0.9 has a `known issue <https://github.com/fchollet/keras/issues/8353>`_ that makes each worker allocate
 all GPUs on the server, instead of the GPU assigned by the *local rank*. If you have multiple GPUs per server, upgrade
@@ -221,7 +221,7 @@ MXNet
 -----
 Horovod supports MXNet and regular TensorFlow in similar ways.
 
-See full training `MNIST <examples/mxnet_mnist.py>`_ and `ImageNet <examples/mxnet_imagenet_resnet50.py>`_ examples. The script below provides a simple skeleton of code block based on MXNet Gluon API.
+See full training `MNIST <https://github.com/horovod/horovod/blob/master/examples/mxnet_mnist.py>`_ and `ImageNet <https://github.com/horovod/horovod/blob/master/examples/mxnet_imagenet_resnet50.py>`_ examples. The script below provides a simple skeleton of code block based on MXNet Gluon API.
 
 .. code-block:: python
 
@@ -349,15 +349,15 @@ You can check for MPI multi-threading support by querying the ``hvd.mpi_threads_
 
 Inference
 ---------
-Learn how to optimize your model for inference and remove Horovod operations from the graph `here <docs/inference.md>`_.
+Learn how to optimize your model for inference and remove Horovod operations from the graph `here <docs/inference.rst>`_.
 
 
 Tensor Fusion
 -------------
 One of the unique things about Horovod is its ability to interleave communication and computation coupled with the ability
-to batch small *allreduce* operations, which results in improved performance. We call this batching feature Tensor Fusion.
+to batch small **allreduce** operations, which results in improved performance. We call this batching feature Tensor Fusion.
 
-See `here <docs/tensor-fusion.md>`__ for full details and tweaking instructions.
+See `here <docs/tensor-fusion.rst>`__ for full details and tweaking instructions.
 
 
 Analyzing Horovod Performance
@@ -367,7 +367,7 @@ Horovod has the ability to record the timeline of its activity, called Horovod T
 .. image:: https://user-images.githubusercontent.com/16640218/29735271-9e148da0-89ac-11e7-9ae0-11d7a099ac89.png
    :alt: Horovod Timeline
 
-See `here <docs/timeline.md>`__ for full details and usage instructions.
+See `here <docs/timeline.rst>`__ for full details and usage instructions.
 
 
 Guides
@@ -376,7 +376,7 @@ Guides
 
 Troubleshooting
 ---------------
-See the `Troubleshooting <docs/troubleshooting.md>`_ page and please submit a `ticket <https://github.com/uber/horovod/issues/new>`_
+See the `Troubleshooting <docs/troubleshooting.rst>`_ page and please submit a `ticket <https://github.com/uber/horovod/issues/new>`_
 if you can't find an answer.
 
 
 
@@ -0,0 +1,68 @@
+
+.. inclusion-marker-start-do-not-remove
+
+
+Benchmarks
+==========
+
+
+.. image:: https://user-images.githubusercontent.com/16640218/38965607-bf5c46ca-4332-11e8-895a-b9c137e86013.png
+   :alt: 512-GPU Benchmark
+
+
+The above benchmark was done on 128 servers with 4 Pascal GPUs each connected by a RoCE-capable 25 Gbit/s network. Horovod
+achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.
+
+To reproduce the benchmarks:
+
+1. Install Horovod using the instructions provided on the `Horovod on GPU <https://github.com/horovod/horovod/blob/master/docs/gpus.rst>`__ page.
+
+2. Clone `https://github.com/tensorflow/benchmarks <https://github.com/tensorflow/benchmarks>`__
+
+.. code-block:: bash
+
+    $ git clone https://github.com/tensorflow/benchmarks
+    $ cd benchmarks
+
+
+3. Run the benchmark. Examples below are for Open MPI.
+
+.. code-block:: bash
+
+    $ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
+        python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
+            --model resnet101 \
+            --batch_size 64 \
+            --variable_update horovod
+
+
+4. At the end of the run, you will see the number of images processed per second:
+
+.. code-block:: bash
+
+    total images/sec: 1656.82
+
+
+**Real data benchmarks**
+
+The benchmark instructions above are for the synthetic data benchmark.
+
+To run the benchmark on a real data, you need to download the `ImageNet dataset <http://image-net.org/download-images>`__
+and convert it using the TFRecord `preprocessing script <https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh>`__.
+
+Now, simply add ``--data_dir /path/to/imagenet/tfrecords --data_name imagenet --num_batches=2000`` to your training command:
+
+.. code-block:: bash
+
+    $ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
+        python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
+            --model resnet101 \
+            --batch_size 64 \
+            --variable_update horovod \
+            --data_dir /path/to/imagenet/tfrecords \
+            --data_name imagenet \
+            --num_batches=2000
+
+
+
+.. inclusion-marker-end-do-not-remove
@@ -0,0 +1,3 @@
+.. include:: ./benchmarks.rst
+   :start-after: inclusion-marker-start-do-not-remove
+   :end-before: inclusion-marker-end-do-not-remove
@@ -0,0 +1,35 @@
+
+.. inclusion-marker-start-do-not-remove
+
+
+Concepts
+========
+
+Horovod core principles are based on the `MPI <http://mpi-forum.org/>`_ concepts *size*, *rank*,
+*local rank*, *allreduce*, *allgather*, and *broadcast*. These are best explained by example. Say we launched
+a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:
+
+* *Size* would be the number of processes, in this case, 16.
+
+* *Rank* would be the unique process ID from 0 to 15 (*size* - 1).
+
+* *Local rank* would be the unique process ID within the server from 0 to 3.
+
+* *Allreduce* is an operation that aggregates data among multiple processes and distributes results back to them.  *Allreduce* is used to average dense tensors.  Here's an illustration from the `MPI Tutorial <http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/>`__:
+
+.. image:: http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/mpi_allreduce_1.png
+   :alt: Allreduce Illustration
+
+* *Allgather* is an operation that gathers data from all processes on every process.  *Allgather* is used to collect values of sparse tensors.  Here's an illustration from the `MPI Tutorial <http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/>`__:
+
+.. image:: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/allgather.png
+   :alt: Allgather Illustration
+
+
+* *Broadcast* is an operation that broadcasts data from one process, identified by root rank, onto every other process. Here's an illustration from the `MPI Tutorial <http://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/>`__:
+
+    .. image:: http://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/broadcast_pattern.png
+       :alt: Broadcast Illustration
+
+
+.. inclusion-marker-end-do-not-remove
@@ -0,0 +1,3 @@
+.. include:: ./concepts.rst
+   :start-after: inclusion-marker-start-do-not-remove
+   :end-before: inclusion-marker-end-do-not-remove
@@ -41,6 +41,7 @@
     'sphinx.ext.autodoc',
     'sphinx.ext.viewcode',
     'sphinxcontrib.napoleon',
+    'nbsphinx',
 ]
 
 # Add any paths that contain templates here, relative to this directory.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+.. include:: ./benchmarks.rst`
	`2`	`+ :start-after: inclusion-marker-start-do-not-remove`
	`3`	`+ :end-before: inclusion-marker-end-do-not-remove`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+.. include:: ./concepts.rst`
	`2`	`+ :start-after: inclusion-marker-start-do-not-remove`
	`3`	`+ :end-before: inclusion-marker-end-do-not-remove`
Original file line number	Diff line number	Diff line change
`@@ -41,6 +41,7 @@`
`41`	`41`	`'sphinx.ext.autodoc',`
`42`	`42`	`'sphinx.ext.viewcode',`
`43`	`43`	`'sphinxcontrib.napoleon',`
	`44`	`+ 'nbsphinx',`
`44`	`45`	`]`
`45`	`46`
`46`	`47`	`# Add any paths that contain templates here, relative to this directory.`