Added HOROVOD_GPU_OPERATIONS installation variable (horovod#1960)

tgaddair · web-flow · commit ffd93108c159 · 2020-05-18T13:09:01.000-07:00
Signed-off-by: Travis Addair &lt;taddair@uber.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
 - Added bare-metal elastic mode implementation to enable auto-scaling and fault tolerance. ([#1849](https://github.com/horovod/horovod/pull/1849))
 
+- Added NCCL implementation of the allgather operation. ([#1952](https://github.com/horovod/horovod/pull/1952))
+
+- Added `HOROVOD_GPU_OPERATIONS` installation variable to simplify enabling NCCL support for all GPU operations. ([#1960](https://github.com/horovod/horovod/pull/1960))
+
 ### Changed
 
 ### Deprecated
diff --git a/Dockerfile.gpu b/Dockerfile.gpu
@@ -67,7 +67,7 @@ RUN mkdir /tmp/openmpi && \
 
 # Install Horovod, temporarily using CUDA stubs
 RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
-    HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
+    HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
          pip install --no-cache-dir horovod && \
     ldconfig
 
diff --git a/Dockerfile.test.gpu b/Dockerfile.test.gpu
@@ -14,7 +14,7 @@ ARG PYTORCH_PACKAGE=torch==1.2.0
 ARG TORCHVISION_PACKAGE=torchvision==0.4.0
 ARG MXNET_PACKAGE=mxnet-cu100==1.5.0
 ARG PYSPARK_PACKAGE=pyspark==2.4.0
-ARG HOROVOD_BUILD_FLAGS="HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL"
+ARG HOROVOD_BUILD_FLAGS="HOROVOD_GPU_OPERATIONS=NCCL"
 ARG HOROVOD_MIXED_INSTALL=0
 
 # Set default shell to /bin/bash
diff --git a/README.rst b/README.rst
@@ -52,8 +52,13 @@ about who's involved and how Horovod plays a role, read the LF AI `announcement
 
 .. contents::
 
+|
+
+Documentation
+-------------
 
-The full documentation and an API reference are published at https://horovod.readthedocs.io/en/latest.
+- `Latest Release <https://horovod.readthedocs.io/en/stable>`_
+- `master <https://horovod.readthedocs.io/en/latest>`_
 
 |
 
@@ -118,7 +123,7 @@ To install Horovod:
 
    .. code-block:: bash
 
-      $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod
+      $ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
 
 This basic installation is good for laptops and for getting to know Horovod.
 
diff --git a/docs/adasum_user_guide.rst b/docs/adasum_user_guide.rst
@@ -77,7 +77,7 @@ Below are the requirements for running Horovod with AdaSum:
 
 *Using NCCL:*
 
-If the **HOROVOD_GPU_ALLREDUCE=NCCL** flag is used to compile Horovod, NCCL is used instead. In this case, NCCL will be used for intra-node communication, and AdaSum will be used for inter-node communication.
+If the **HOROVOD_GPU_OPERATIONS=NCCL** flag is used to compile Horovod, NCCL is used instead. In this case, NCCL will be used for intra-node communication, and AdaSum will be used for inter-node communication.
 
 Modes of Operation
 =====================
@@ -207,10 +207,10 @@ Key Takeaways
 
 -   As the number of ranks scales up, the learning rate does not need to be scaled linearly if using CPU to do AdaSum reduction. A good scaling factor would be between 2\-2.5 over the best learning rate for a single worker.
 
--   If the HOROVOD_GPU_ALLREDUCE=NCCL flag is used to compile Horovod, the learning rate that should be used is equal to the best learning rate for a single	worker (GPU) scaled by the number of GPUs locally on a node. On very large	clusters, scaling this even more by another factor of 1.5\-2.0x might give	better results but is not guaranteed and should be tried only if scaling by just the local size is not sufficient for good convergence.
+-   If the HOROVOD_GPU_OPERATIONS=NCCL flag is used to compile Horovod, the learning rate that should be used is equal to the best learning rate for a single	worker (GPU) scaled by the number of GPUs locally on a node. On very large	clusters, scaling this even more by another factor of 1.5\-2.0x might give	better results but is not guaranteed and should be tried only if scaling by just the local size is not sufficient for good convergence.
 
 -   Pytorch training in fp16 format is not yet supported. Integration of Apex	into the new optimizer to enabled full mixed precision training with AdaSum in Pytorch is a work in progress.
 
--   When HOROVOD_GPU_ALLREDUCE=NCCL flag is used to compile Horovod and training	is run on a single node, only averaging through NCCL library is used to	perform reductions and no AdaSum algorithm will take place in this configuration.
+-   When HOROVOD_GPU_OPERATIONS=NCCL flag is used to compile Horovod and training	is run on a single node, only averaging through NCCL library is used to	perform reductions and no AdaSum algorithm will take place in this configuration.
 
 .. inclusion-marker-end-do-not-remove
diff --git a/docs/gpus.rst b/docs/gpus.rst
@@ -45,20 +45,20 @@ by installing an `nv_peer_memory <https://github.com/Mellanox/nv_peer_memory>`__
 
    .. code-block:: bash
 
-       $ HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL pip install --no-cache-dir horovod
+       $ HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
 
 
    If you installed NCCL 2 using the Ubuntu package, you can run:
 
    .. code-block:: bash
 
-       $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL pip install --no-cache-dir horovod
+       $ HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
    
    If you installed NCCL 2 using the `CentOS / RHEL package <https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html#rhel_centos>`__, you can run:
 
    .. code-block:: bash
 
-       $ HOROVOD_NCCL_INCLUDE=/usr/include HOROVOD_NCCL_LIB=/usr/lib64 HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL pip install --no-cache-dir horovod
+       $ HOROVOD_NCCL_INCLUDE=/usr/include HOROVOD_NCCL_LIB=/usr/lib64 HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
 
 
 **Note**: Some models with a high computation to communication ratio benefit from doing allreduce on CPU, even if a
@@ -87,7 +87,7 @@ configure Horovod to use them as well:
 
 .. code-block:: bash
 
-    $ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip install --no-cache-dir horovod
+    $ HOROVOD_GPU_OPERATIONS=MPI pip install --no-cache-dir horovod
 
 
 **Note**: Allgather allocates an output tensor which is proportionate to the number of processes participating in the
diff --git a/docs/install.rst b/docs/install.rst
@@ -152,8 +152,8 @@ will be used for CPU operations. You can override this by setting ``HOROVOD_CPU_
 NCCL
 ~~~~
 
-NCCL is currently supported for Allreduce and Broadcast operations.  You can enable these by setting
-``HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL`` during installation.
+NCCL is supported for Allreduce, Allgather, and Broadcast operations.  You can enable these by setting
+``HOROVOD_GPU_OPERATIONS=NCCL`` during installation.
 
 NCCL operations are supported on both Nvidia (CUDA) and AMD (ROCm) GPUs. You can set ``HOROVOD_GPU`` in your
 environment to specify building with CUDA or ROCm. CUDA will be assumed if not specified.
@@ -223,8 +223,9 @@ Possible values are given in curly brackets: {}.
 * ``HOROVOD_WITH_MPI`` - {1}. Require that Horovod is built with MPI support enabled.
 * ``HOROVOD_WITHOUT_MPI`` - {1}. Skip building with MPI support.
 * ``HOROVOD_GPU`` - {CUDA, ROCM}. Framework to use for GPU operations.
-* ``HOROVOD_GPU_ALLREDUCE`` - {NCCL, MPI, DDL}. Framework to use for GPU tensor allreduce.
-* ``HOROVOD_GPU_ALLGATHER`` - {MPI}. Framework to use for GPU tensor allgather.
+* ``HOROVOD_GPU_OPERATIONS`` - {NCCL, MPI}. Framework to use for GPU tensor allreduce, allgather, and broadcast.
+* ``HOROVOD_GPU_ALLREDUCE`` - {NCCL, MPI}. Framework to use for GPU tensor allreduce.
+* ``HOROVOD_GPU_ALLGATHER`` - {NCCL, MPI}. Framework to use for GPU tensor allgather.
 * ``HOROVOD_GPU_BROADCAST`` - {NCCL, MPI}. Framework to use for GPU tensor broadcast.
 * ``HOROVOD_ALLOW_MIXED_GPU_IMPL`` - {1}. Allow Horovod to install with NCCL allreduce and MPI GPU allgather / broadcast.  Not recommended due to a possible deadlock.
 * ``HOROVOD_CPU_OPERATIONS`` - {MPI, GLOO, CCL}. Framework to use for CPU tensor allreduce, allgather, and broadcast.
diff --git a/docs/summary.rst b/docs/summary.rst
@@ -52,9 +52,6 @@ about who's involved and how Horovod plays a role, read the LF AI `announcement
 
 .. contents::
 
-
-The full documentation and an API reference are published at https://horovod.readthedocs.io/en/latest.
-
 |
 
 Why Horovod?
@@ -118,7 +115,7 @@ To install Horovod:
 
    .. code-block:: bash
 
-      $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod
+      $ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
 
 This basic installation is good for laptops and for getting to know Horovod.
 
diff --git a/docs/summary.rst.patch b/docs/summary.rst.patch
@@ -0,0 +1,9 @@
+57,64d56
+< Documentation
+< -------------
+< 
+< - `Latest Release <https://horovod.readthedocs.io/en/stable>`_
+< - `master <https://horovod.readthedocs.io/en/latest>`_
+< 
+< |
+< 
diff --git a/docs/troubleshooting.rst b/docs/troubleshooting.rst
@@ -56,7 +56,7 @@ To use CUDA stub drivers:
     $ ldconfig /usr/local/cuda/lib64/stubs
 
     # install Horovod, add other HOROVOD_* environment variables as necessary
-    $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
+    $ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
 
     # revert to standard libraries
     $ ldconfig
@@ -90,7 +90,7 @@ To use custom MPI directory:
 .. code-block:: bash
 
     $ export PATH=$PATH:/path/to/mpi/bin
-    $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
+    $ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
 
 
 2. Are MPI libraries added to ``$LD_LIBRARY_PATH`` or ``ld.so.conf``?
@@ -202,14 +202,14 @@ For example:
 
 .. code-block:: bash
 
-    $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
+    $ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
 
 
 Or:
 
 .. code-block:: bash
 
-    $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_INCLUDE=/path/to/nccl/include HOROVOD_NCCL_LIB=/path/to/nccl/lib pip install --no-cache-dir horovod
+    $ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_INCLUDE=/path/to/nccl/include HOROVOD_NCCL_LIB=/path/to/nccl/lib pip install --no-cache-dir horovod
 
 
 Pip install: no such option: --no-cache-dir
@@ -237,7 +237,7 @@ For example:
 
 .. code-block:: bash
 
-    $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
+    $ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
 
 
 ncclAllReduce failed: invalid data type
@@ -260,7 +260,7 @@ the package and reinstall Horovod:
 
     $ conda remove nccl
     $ pip uninstall -y horovod
-    $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
+    $ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
 
 
 transport/p2p.cu:431 WARN failed to open CUDA IPC handle : 30 unknown error
@@ -322,15 +322,15 @@ To build Horovod with a specific CUDA version, use the ``HOROVOD_CUDA_HOME`` env
 .. code-block:: bash
 
     $ pip uninstall -y horovod
-    $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl HOROVOD_CUDA_HOME=/path/to/cuda pip install --no-cache-dir horovod
+    $ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl HOROVOD_CUDA_HOME=/path/to/cuda pip install --no-cache-dir horovod
 
 
 Alternatively, you can use the ``HOROVOD_CUDA_INCLUDE`` and ``HOROVOD_CUDA_LIB`` environment variables to specify the CUDA library to use:
 
 .. code-block:: bash
 
     $ pip uninstall -y horovod
-    $ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_NCCL_HOME=/path/to/nccl HOROVOD_CUDA_INCLUDE=/path/to/cuda/include HOROVOD_CUDA_LIB=/path/to/cuda/lib64 pip install --no-cache-dir horovod
+    $ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl HOROVOD_CUDA_INCLUDE=/path/to/cuda/include HOROVOD_CUDA_LIB=/path/to/cuda/lib64 pip install --no-cache-dir horovod
 
 
 FORCE-TERMINATE AT Data unpack would read past end of buffer
diff --git a/horovod/keras/__init__.py b/horovod/keras/__init__.py
@@ -45,9 +45,9 @@ def DistributedOptimizer(optimizer, name=None,
               gradients. Defaults to "Distributed" followed by the provided
               optimizer type.
         device_dense: Device to be used for dense tensors. Uses GPU by default
-                      if Horovod was build with HOROVOD_GPU_ALLREDUCE.
+                      if Horovod was build with HOROVOD_GPU_OPERATIONS.
         device_sparse: Device to be used for sparse tensors. Uses GPU by default
-                       if Horovod was build with HOROVOD_GPU_ALLGATHER.
+                       if Horovod was build with HOROVOD_GPU_OPERATIONS.
         compression: Compression algorithm used to reduce the amount of data
                      sent and received by each worker node.  Defaults to not
                      using compression.
diff --git a/horovod/keras/callbacks.py b/horovod/keras/callbacks.py
@@ -36,7 +36,7 @@ def __init__(self, root_rank, device=''):
         Args:
             root_rank: Rank that will send data, other ranks will receive data.
             device: Device to be used for broadcasting. Uses GPU by default
-                    if Horovod was build with HOROVOD_GPU_BROADCAST.
+                    if Horovod was build with HOROVOD_GPU_OPERATIONS.
         """
         super(BroadcastGlobalVariablesCallback, self).__init__(K, root_rank, device)
 
@@ -58,7 +58,7 @@ def __init__(self, device=''):
 
         Args:
             device: Device to be used for allreduce. Uses GPU by default
-                    if Horovod was build with HOROVOD_GPU_ALLREDUCE.
+                    if Horovod was build with HOROVOD_GPU_OPERATIONS.
         """
         super(MetricAverageCallback, self).__init__(K, device)
 
diff --git a/horovod/tensorflow/__init__.py b/horovod/tensorflow/__init__.py
@@ -57,9 +57,9 @@ def allreduce(tensor, average=None, device_dense='', device_sparse='',
                 Use `op` instead. Will be removed in v0.21.0.
 
         device_dense: Device to be used for dense tensors. Uses GPU by default
-                      if Horovod was built with HOROVOD_GPU_ALLREDUCE.
+                      if Horovod was built with HOROVOD_GPU_OPERATIONS.
         device_sparse: Device to be used for sparse tensors. Uses GPU by default
-                       if Horovod was built with HOROVOD_GPU_ALLGATHER.
+                       if Horovod was built with HOROVOD_GPU_OPERATIONS.
         compression: Compression algorithm used to reduce the amount of data
                      sent and received by each worker node.  Defaults to not
                      using compression.
@@ -110,7 +110,7 @@ def allreduce(tensor, average=None, device_dense='', device_sparse='',
                     else:
                         warnings.warn('Adasum reduction does not currently support GPU reduction using MPI. Tensors '
                                       'are copied to CPU memory instead. To use Adasum for GPU reduction, please '
-                                      'compile Horovod with HOROVOD_GPU_ALLREDUCE=NCCL.')
+                                      'compile Horovod with HOROVOD_GPU_OPERATIONS=NCCL.')
                         new_tensor = summed_tensor
                 else:
                     if not check_num_rank_power_of_2(size()):
@@ -184,7 +184,7 @@ def __init__(self, root_rank, device=''):
                 Rank that will send data, other ranks will receive data.
               device:
                 Device to be used for broadcasting. Uses GPU by default
-                if Horovod was built with HOROVOD_GPU_BROADCAST.
+                if Horovod was built with HOROVOD_GPU_OPERATIONS.
             """
             super(BroadcastGlobalVariablesHook, self).__init__()
             self.root_rank = root_rank
@@ -401,10 +401,10 @@ def DistributedOptimizer(optimizer, name=None, use_locking=False, device_dense='
         See Optimizer.__init__ for more info.
       device_dense:
         Device to be used for dense tensors. Uses GPU by default
-        if Horovod was built with HOROVOD_GPU_ALLREDUCE.
+        if Horovod was built with HOROVOD_GPU_OPERATIONS.
       device_sparse:
         Device to be used for sparse tensors. Uses GPU by default
-        if Horovod was built with HOROVOD_GPU_ALLGATHER.
+        if Horovod was built with HOROVOD_GPU_OPERATIONS.
       compression:
         Compression algorithm used during allreduce to reduce the amount
         of data sent during each parameter update step.  Defaults to
@@ -477,10 +477,10 @@ def DistributedGradientTape(gradtape, device_dense='', device_sparse='',
             GradientTape to use for computing gradients and applying updates.
           device_dense:
             Device to be used for dense tensors. Uses GPU by default
-            if Horovod was built with HOROVOD_GPU_ALLREDUCE.
+            if Horovod was built with HOROVOD_GPU_OPERATIONS.
           device_sparse:
             Device to be used for sparse tensors. Uses GPU by default
-            if Horovod was built with HOROVOD_GPU_ALLGATHER.
+            if Horovod was built with HOROVOD_GPU_OPERATIONS.
           compression:
             Compression algorithm used during allreduce to reduce the amount
             of data sent during each parameter update step.  Defaults to
diff --git a/horovod/tensorflow/keras/__init__.py b/horovod/tensorflow/keras/__init__.py
@@ -63,9 +63,9 @@ def DistributedOptimizer(optimizer, name=None,
               gradients. Defaults to "Distributed" followed by the provided
               optimizer type.
         device_dense: Device to be used for dense tensors. Uses GPU by default
-                      if Horovod was build with HOROVOD_GPU_ALLREDUCE.
+                      if Horovod was build with HOROVOD_GPU_OPERATIONS.
         device_sparse: Device to be used for sparse tensors. Uses GPU by default
-                       if Horovod was build with HOROVOD_GPU_ALLGATHER.
+                       if Horovod was build with HOROVOD_GPU_OPERATIONS.
         compression: Compression algorithm used to reduce the amount of data
                      sent and received by each worker node.  Defaults to not
                      using compression.
diff --git a/horovod/tensorflow/keras/callbacks.py b/horovod/tensorflow/keras/callbacks.py
@@ -43,7 +43,7 @@ def __init__(self, root_rank, device=''):
         Args:
             root_rank: Rank that will send data, other ranks will receive data.
             device: Device to be used for broadcasting. Uses GPU by default
-                    if Horovod was build with HOROVOD_GPU_BROADCAST.
+                    if Horovod was build with HOROVOD_GPU_OPERATIONS.
         """
         super(BroadcastGlobalVariablesCallback, self).__init__(K, root_rank, device)
 
@@ -65,7 +65,7 @@ def __init__(self, device=''):
 
         Args:
             device: Device to be used for allreduce. Uses GPU by default
-                    if Horovod was build with HOROVOD_GPU_ALLREDUCE.
+                    if Horovod was build with HOROVOD_GPU_OPERATIONS.
         """
         super(MetricAverageCallback, self).__init__(K, device)
 
diff --git a/horovod/torch/mpi_ops.py b/horovod/torch/mpi_ops.py
@@ -108,7 +108,7 @@ def _allreduce_async(tensor, output, name, op):
             else:
                 warnings.warn('Adasum reduction does not currently support GPU reduction using MPI. Tensors are '
                               'copied to CPU memory instead. To use Adasum for GPU reduction, please compile Horovod '
-                              'with HOROVOD_GPU_ALLREDUCE=NCCL.')
+                              'with HOROVOD_GPU_OPERATIONS=NCCL.')
                 divisor = 1
         else:
             if not num_rank_is_power_2(size()):
diff --git a/setup.py b/setup.py
diff --git a/test/test_mxnet.py b/test/test_mxnet.py
diff --git a/test/test_tensorflow.py b/test/test_tensorflow.py
diff --git a/test/test_torch.py b/test/test_torch.py