Add some documentation for Alltoall and Process Sets (horovod#3096)

maxhgerlach · web-flow · commit 1359e3a23073 · 2021-08-15T16:12:34.000-07:00
Signed-off-by: Max H. Gerlach &lt;git@maxgerlach.de&gt;
diff --git a/README.rst b/README.rst
@@ -163,7 +163,8 @@ To compile Horovod from source, follow the instructions in the `Contributor Guid
 Concepts
 --------
 Horovod core principles are based on `MPI <http://mpi-forum.org/>`_ concepts such as *size*, *rank*,
-*local rank*, **allreduce**, **allgather** and, *broadcast*. See `this page <docs/concepts.rst>`_ for more details.
+*local rank*, **allreduce**, **allgather**, **broadcast**, and **alltoall**. See `this page <docs/concepts.rst>`_
+for more details.
 
 Supported frameworks
 --------------------
@@ -389,6 +390,14 @@ a good amount of trial and error. We provide a system to automate this performan
 See `here <docs/autotune.rst>`__ for full details and usage instructions.
 
 
+Horovod Process Sets
+--------------------
+Horovod allows you to concurrently run distinct collective operations in different groups of processes taking part in
+one distributed training. Set up ``hvd.process_set`` objects to make use of this capability.
+
+See `Process Sets <docs/process_set.rst>`__ for detailed instructions.
+
+
 Guides
 ------
 1. Run distributed training in Microsoft Azure using `Batch AI and Horovod <https://github.com/Azure/BatchAI/tree/master/recipes/Horovod>`_.
diff --git a/docs/concepts.rst b/docs/concepts.rst
@@ -6,7 +6,7 @@ Concepts
 ========
 
 Horovod core principles are based on the `MPI <http://mpi-forum.org/>`_ concepts *size*, *rank*,
-*local rank*, *allreduce*, *allgather*, and *broadcast*. These are best explained by example. Say we launched
+*local rank*, *allreduce*, *allgather*, *broadcast*, and *alltoall*. These are best explained by example. Say we launched
 a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:
 
 * *Size* would be the number of processes, in this case, 16.
@@ -32,4 +32,7 @@ a training script on 4 servers, each having 4 GPUs. If we launched one copy of t
        :alt: Broadcast Illustration
 
 
+* *Alltoall* is an operation to exchange data between all processes.  *Alltoall* may be useful to implement neural networks with advanced architectures that span multiple devices.
+
+
 .. inclusion-marker-end-do-not-remove
diff --git a/docs/index.rst b/docs/index.rst
@@ -141,6 +141,8 @@ Guides
 
    autotune_include
 
+   process_set_include
+
    troubleshooting_include
 
    contributors_include
diff --git a/docs/process_set.rst b/docs/process_set.rst
@@ -0,0 +1,100 @@
+.. inclusion-marker-start-do-not-remove
+
+Process Sets: Concurrently Running Collective Operations
+========================================================
+
+Most Horovod operations in TensorFlow, PyTorch, or MXNet feature a ``process_set`` argument: By setting up different
+process sets you may have multiple subsets of the world of Horovod processes run distinct collective operations in
+parallel. Besides Horovod's fundamental operations like ``hvd.allgather``, ``hvd.allreduce``, ``hvd.alltoall``,
+``hvd.broadcast``, or ``hvd.grouped_allreduce``, also many high-level utility objects such as
+``hvd.DistributedOptimizer`` come with support for process sets.
+
+As an example consider building a Horovod model to be trained by four worker processes with two concurrent allreduce
+operations on the "even" or "odd" subset.  In the following we will see three ways to configure Horovod to use an even
+and an odd process set, offering you as much flexibility as you need. The code examples are presented for TensorFlow,
+but the interface for the other supported frameworks is equivalent.
+
+1) Static process sets
+----------------------
+
+.. code-block:: python
+
+    # on all ranks
+    even_set = hvd.ProcessSet([0,2])
+    odd_set = hvd.ProcessSet([1,3])
+    hvd.init(process_sets=[even_set, odd_set])
+
+    for p in [hvd.global_process_set, even_set, odd_set]:
+      print(p)
+    # ProcessSet(process_set_id=0, ranks=[0, 1, 2, 3], mpi_comm=None)
+    # ProcessSet(process_set_id=1, ranks=[0, 2], mpi_comm=None)
+    # ProcessSet(process_set_id=2, ranks=[1, 3], mpi_comm=None)
+
+    # on ranks 0 and 2
+    result = hvd.allreduce(tensor_for_even_ranks, process_set=even_set)
+
+    # on ranks 1 and 3
+    result = hvd.allreduce(tensor_for_odd_ranks, process_set=odd_set)
+
+Having initialized Horovod like this, the configuration of process sets cannot be changed without restarting the
+program.  If you only use the default global process set (``hvd.global_process_set``), there is no impact on
+performance.
+
+2) Static process sets from MPI communicators
+---------------------------------------------
+
+.. code-block:: python
+
+    # on all ranks
+    from mpi4py import MPI
+    comm = MPI.COMM_WORLD
+    subcomm = MPI.COMM_WORLD.Split(color=MPI.COMM_WORLD.rank % 2,
+                                   key=MPI.COMM_WORLD.rank)
+
+    split_process_set = hvd.ProcessSet(subcomm)
+
+    hvd.init(comm, process_sets=[split_process_set])
+
+    for p in [hvd.global_process_set, split_process_set]:
+        print(p)
+    # ProcessSet(process_set_id=0, ranks=[0, 1, 2, 3], mpi_comm=<mpi4py.MPI.Intracomm object at 0x7fb817323dd0>)
+    # ProcessSet(process_set_id=1, ranks=[0, 2], mpi_comm=<mpi4py.MPI.Intracomm object at 0x7fb87e2ddfb0>)
+    ## (split_process_set differs by rank)
+
+    # on ranks 0 and 2
+    result = hvd.allreduce(tensor_for_even_ranks, process_set=split_process_set)
+
+    # on ranks 1 and 3
+    result = hvd.allreduce(tensor_for_odd_ranks, process_set=split_process_set)
+
+If you are already using multiple MPI communicators in your distributed program, you can plug them right in.
+
+3) Dynamic process sets
+-----------------------
+
+.. code-block:: python
+
+    # on all ranks
+    hvd.init(process_sets="dynamic")  # alternatively set HOROVOD_DYNAMIC_PROCESS_SETS=1
+    even_set = hvd.add_process_set([0,2])
+    odd_set = hvd.add_process_set([1,3])
+
+    for p in [hvd.global_process_set, even_set, odd_set]:
+      print(p)
+    # ProcessSet(process_set_id=0, ranks=[0, 1, 2, 3], mpi_comm=None)
+    # ProcessSet(process_set_id=1, ranks=[0, 2], mpi_comm=None)
+    # ProcessSet(process_set_id=2, ranks=[1, 3], mpi_comm=None)
+
+    # on ranks 0 and 2
+    result = hvd.allreduce(tensor_for_even_ranks, process_set=even_set)
+
+    # on ranks 1 and 3
+    result = hvd.allreduce(tensor_for_odd_ranks, process_set=odd_set)
+
+The most flexible setup is achieved with "dynamic" process sets.  Process sets can be registered and deregistered
+dynamically at any time after initializing Horovod via ``hvd.add_process_set()`` and ``hvd.remove_process_set()``.
+Calls to these functions must be made identically and in the same order by all processes.
+
+Note that dynamic process sets come with some slight extra synchronization overhead.
+
+.. inclusion-marker-end-do-not-remove
diff --git a/docs/process_set_include.rst b/docs/process_set_include.rst
@@ -0,0 +1,3 @@
+.. include:: ./process_set.rst
+   :start-after: inclusion-marker-start-do-not-remove
+   :end-before: inclusion-marker-end-do-not-remove
diff --git a/docs/summary.rst b/docs/summary.rst
@@ -155,7 +155,8 @@ To compile Horovod from source, follow the instructions in the `Contributor Guid
 Concepts
 --------
 Horovod core principles are based on `MPI <http://mpi-forum.org/>`_ concepts such as *size*, *rank*,
-*local rank*, **allreduce**, **allgather** and, *broadcast*. See `this page <concepts.rst>`_ for more details.
+*local rank*, **allreduce**, **allgather**, **broadcast**, and **alltoall**. See `this page <concepts.rst>`_
+for more details.
 
 Supported frameworks
 --------------------
@@ -381,6 +382,14 @@ a good amount of trial and error. We provide a system to automate this performan
 See `here <autotune.rst>`__ for full details and usage instructions.
 
 
+Horovod Process Sets
+--------------------
+Horovod allows you to concurrently run distinct collective operations in different groups of processes taking part in
+one distributed training. Set up ``hvd.process_set`` objects to make use of this capability.
+
+See `Process Sets <process_set.rst>`__ for detailed instructions.
+
+
 Guides
 ------
 1. Run distributed training in Microsoft Azure using `Batch AI and Horovod <https://github.com/Azure/BatchAI/tree/master/recipes/Horovod>`_.

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+.. include:: ./process_set.rst`
	`2`	`+ :start-after: inclusion-marker-start-do-not-remove`
	`3`	`+ :end-before: inclusion-marker-end-do-not-remove`