Skip to content

Commit 5370ebc

Browse files
authored
Added support for Gloo on macOS (horovod#2254)
Signed-off-by: Travis Addair <[email protected]>
1 parent 3dc1ade commit 5370ebc

File tree

9 files changed

+59
-58
lines changed

9 files changed

+59
-58
lines changed

CMakeLists.txt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -213,12 +213,15 @@ message(FATAL_ERROR "You should not mix NCCL and MPI GPU due to a possible deadl
213213
endif()
214214

215215
# Gloo
216-
if (NOT "$ENV{HOROVOD_WITHOUT_GLOO}" STREQUAL "1" AND NOT ${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
216+
if (NOT "$ENV{HOROVOD_WITHOUT_GLOO}" STREQUAL "1")
217217
if(HAVE_MPI)
218218
set(USE_MPI TRUE)
219219
else()
220220
set(USE_MPI FALSE)
221221
endif()
222+
if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
223+
set(USE_LIBUV_DEFAULT ON)
224+
endif()
222225
set(CMAKE_POLICY_DEFAULT_CMP0074 NEW)
223226
add_subdirectory(third_party/gloo)
224227
include_directories(third_party/gloo)
@@ -231,9 +234,6 @@ if (NOT "$ENV{HOROVOD_WITHOUT_GLOO}" STREQUAL "1" AND NOT ${CMAKE_SYSTEM_NAME} M
231234
add_definitions(-DHAVE_GLOO=1)
232235
set(HAVE_GLOO TRUE)
233236
endif()
234-
if (NOT HAVE_MPI AND ${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
235-
message(FATAL_ERROR "Gloo cannot be compiled on MacOS, install MPI.")
236-
endif()
237237

238238
# NCCL + MPI
239239
if (HAVE_NCCL AND HAVE_MPI)

Dockerfile.gpu

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,6 @@ RUN pip install tensorflow==${TENSORFLOW_VERSION} \
4848
keras \
4949
h5py
5050

51-
# https://download.pytorch.org/whl/cu101/torch-1.6.0%2Bcu101-cp37-cp37m-linux_x86_64.whl
5251
RUN PYTAGS=$(python -c "from packaging import tags; tag = list(tags.sys_tags())[0]; print(f'{tag.interpreter}-{tag.abi}')") && \
5352
pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}%2Bcu101-${PYTAGS}-linux_x86_64.whl \
5453
https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}%2Bcu101-${PYTAGS}-linux_x86_64.whl

README.rst

Lines changed: 5 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -107,21 +107,11 @@ To install Horovod:
107107

108108
1. Install `CMake <https://cmake.org/install/>`__
109109

110-
2. *Optional*: Install `Open MPI <https://www.open-mpi.org/>`_ or another MPI implementation.
111-
112-
Learn how to install Open MPI `on this page <https://www.open-mpi.org/faq/?category=building#easy-build>`_.
113-
114-
**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.
115-
116-
**Note (Linux)**: Linux users can use `Gloo <https://github.com/facebookincubator/gloo>`__ as an alternative to MPI, which requires no extra dependencies.
117-
118-
**Note (macOS)**: MPI is required for Horovod on macOS, as Gloo is currently unavailable.
119-
120110
.. raw:: html
121111

122112
<p/>
123113

124-
3. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.
114+
2. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.
125115

126116
If you've installed PyTorch from `PyPI <https://pypi.org/project/torch>`__, make sure that the ``g++-4.9`` or above is installed.
127117

@@ -131,7 +121,7 @@ To install Horovod:
131121

132122
<p/>
133123

134-
4. Install the ``horovod`` pip package.
124+
3. Install the ``horovod`` pip package.
135125

136126
To run on CPUs:
137127

@@ -145,12 +135,12 @@ To install Horovod:
145135
146136
$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
147137
148-
This basic installation is good for laptops and for getting to know Horovod.
149-
150138
For more details on installing Horovod with GPU support, read `Horovod on GPU <docs/gpus.rst>`_.
151139

152140
For the full list of Horovod installation options, read the `Installation Guide <docs/install.rst>`_.
153141

142+
If you want to use MPI, read `Horovod with MPI <docs/mpi.rst>`_.
143+
154144
If you want to use Conda, read `Building a Conda environment with GPU support for Horovod <docs/conda.rst>`_.
155145

156146
If you want to use Docker, read `Horovod in Docker <docs/docker.rst>`_.
@@ -306,17 +296,14 @@ Gloo
306296
----
307297
`Gloo <https://github.com/facebookincubator/gloo>`_ is an open source collective communications library developed by Facebook.
308298

309-
Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed. Gloo support only requires
310-
that you have `CMake <https://cmake.org/>`_ installed, and is only supported on Linux at this time.
299+
Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed.
311300

312301
For environments that have support both MPI and Gloo, you can choose to use Gloo at runtime by passing the ``--gloo`` argument to ``horovodrun``:
313302

314303
.. code-block:: bash
315304
316305
$ horovodrun --gloo -np 2 python train.py
317306
318-
Gloo support is still early in its development, and more features are coming soon.
319-
320307
mpi4py
321308
------
322309
Horovod supports mixing and matching Horovod collectives with other MPI libraries, such as `mpi4py <https://mpi4py.scipy.org>`_,

build-docker-images.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ function build_one()
2222
docker build -f Dockerfile.${device} -t ${tag} --build-arg python=${py} --no-cache .
2323
horovod_version=$(docker run --rm ${tag} pip show horovod | grep Version | awk '{print $2}')
2424
tensorflow_version=$(docker run --rm ${tag} pip show ${tensorflow_pkg} | grep Version | awk '{print $2}')
25-
pytorch_version=$(docker run --rm ${tag} pip show torch | grep Version | awk '{print $2}')
25+
pytorch_version=$(docker run --rm ${tag} pip show torch | grep Version | sed 's/+/ /g' | awk '{print $2}')
2626
mxnet_version=$(docker run --rm ${tag} pip show ${mxnet_pkg} | grep Version | awk '{print $2}')
2727
final_tag=horovod/horovod:${horovod_version}-tf${tensorflow_version}-torch${pytorch_version}-mxnet${mxnet_version}-py${py}-${device}
2828
docker tag ${tag} ${final_tag}

docs/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,8 @@ Guides
117117

118118
gpus_include
119119

120+
mpi_include
121+
120122
conda_include
121123

122124
docker_include

docs/mpirun.rst renamed to docs/mpi.rst

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,18 @@
1-
:orphan:
1+
.. inclusion-marker-start-do-not-remove
2+
3+
Horovod with MPI
4+
================
5+
6+
MPI can be used as an alternative to Gloo for coordinating work between processes in Horovod. When using NCCL, performance
7+
will be similar between the two, but if you are doing CPU training, there are noticeable performance benefits to using MPI.
8+
9+
First install `Open MPI <https://www.open-mpi.org/>`_ or another MPI implementation. Learn how to install Open MPI `on this page <https://www.open-mpi.org/faq/?category=building#easy-build>`_.
10+
11+
**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.
12+
13+
mpirun
14+
------
215

3-
Run Horovod with Open MPI
4-
=========================
516
``horovodrun`` introduces a convenient, Open MPI-based wrapper for running Horovod scripts.
617

718
In some cases it is desirable to have fine-grained control over options passed to Open MPI. This page describes
@@ -56,7 +67,7 @@ With the ``-x`` option you can specify (``-x NCCL_DEBUG=INFO``) or copy (``-x LD
5667
all the workers.
5768

5869
Custom SSH ports
59-
----------------
70+
~~~~~~~~~~~~~~~~
6071

6172
Specify custom SSH ports with ``-mca plm_rsh_args "-p <port>"`` as follows:
6273

@@ -73,7 +84,7 @@ Specify custom SSH ports with ``-mca plm_rsh_args "-p <port>"`` as follows:
7384
This is frequently useful in the case of `running Horovod in Docker environment <docker.rst>`_.
7485

7586
Open MPI with RDMA
76-
------------------
87+
~~~~~~~~~~~~~~~~~~
7788

7889
As noted above, using TCP for MPI communication does not have any significant effects on performance in the majority of
7990
cases. Models that make heavy use of ``hvd.broadcast()`` and ``hvd.allgather()`` operations are exceptions to that rule.
@@ -95,7 +106,7 @@ Other MPI RDMA implementations may or may not benefit from disabling multithread
95106
documentation.
96107

97108
Horovod Parameter Knobs
98-
-----------------------
109+
~~~~~~~~~~~~~~~~~~~~~~~
99110

100111
Many of the configurable parameters available as command line arguments to ``horovodrun`` can be used with ``mpirun``
101112
through the use of environment variables.
@@ -121,7 +132,7 @@ Autotuning:
121132
Note that when using ``horovodrun``, any command line arguments will override values set in the environment.
122133

123134
Hangs due to non-routed network interfaces
124-
------------------------------------------
135+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
125136

126137
Having network interfaces that are not routed can cause Open MPI to hang. An example of such interface is ``docker0``.
127138

@@ -177,3 +188,5 @@ Example ``mpirun`` command with ``lo`` and ``docker0`` interfaces excluded:
177188
-mca pml ob1 -mca btl ^openib \
178189
-mca btl_tcp_if_exclude lo,docker0 \
179190
python train.py
191+
192+
.. inclusion-marker-end-do-not-remove

docs/mpi_include.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.. include:: ./mpi.rst
2+
:start-after: inclusion-marker-start-do-not-remove
3+
:end-before: inclusion-marker-end-do-not-remove

docs/summary.rst

Lines changed: 5 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -99,21 +99,11 @@ To install Horovod:
9999

100100
1. Install `CMake <https://cmake.org/install/>`__
101101

102-
2. *Optional*: Install `Open MPI <https://www.open-mpi.org/>`_ or another MPI implementation.
103-
104-
Learn how to install Open MPI `on this page <https://www.open-mpi.org/faq/?category=building#easy-build>`_.
105-
106-
**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.
107-
108-
**Note (Linux)**: Linux users can use `Gloo <https://github.com/facebookincubator/gloo>`__ as an alternative to MPI, which requires no extra dependencies.
109-
110-
**Note (macOS)**: MPI is required for Horovod on macOS, as Gloo is currently unavailable.
111-
112102
.. raw:: html
113103

114104
<p/>
115105

116-
3. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.
106+
2. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.
117107

118108
If you've installed PyTorch from `PyPI <https://pypi.org/project/torch>`__, make sure that the ``g++-4.9`` or above is installed.
119109

@@ -123,7 +113,7 @@ To install Horovod:
123113

124114
<p/>
125115

126-
4. Install the ``horovod`` pip package.
116+
3. Install the ``horovod`` pip package.
127117

128118
To run on CPUs:
129119

@@ -137,12 +127,12 @@ To install Horovod:
137127
138128
$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
139129
140-
This basic installation is good for laptops and for getting to know Horovod.
141-
142130
For more details on installing Horovod with GPU support, read `Horovod on GPU <gpus.rst>`_.
143131

144132
For the full list of Horovod installation options, read the `Installation Guide <install.rst>`_.
145133

134+
If you want to use MPI, read `Horovod with MPI <mpi.rst>`_.
135+
146136
If you want to use Conda, read `Building a Conda environment with GPU support for Horovod <conda.rst>`_.
147137

148138
If you want to use Docker, read `Horovod in Docker <docker.rst>`_.
@@ -298,17 +288,14 @@ Gloo
298288
----
299289
`Gloo <https://github.com/facebookincubator/gloo>`_ is an open source collective communications library developed by Facebook.
300290

301-
Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed. Gloo support only requires
302-
that you have `CMake <https://cmake.org/>`_ installed, and is only supported on Linux at this time.
291+
Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed.
303292

304293
For environments that have support both MPI and Gloo, you can choose to use Gloo at runtime by passing the ``--gloo`` argument to ``horovodrun``:
305294

306295
.. code-block:: bash
307296
308297
$ horovodrun --gloo -np 2 python train.py
309298
310-
Gloo support is still early in its development, and more features are coming soon.
311-
312299
mpi4py
313300
------
314301
Horovod supports mixing and matching Horovod collectives with other MPI libraries, such as `mpi4py <https://mpi4py.scipy.org>`_,

horovod/common/gloo/gloo_context.cc

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,17 @@
2323
#include "gloo/rendezvous/context.h"
2424
#include "gloo/rendezvous/file_store.h"
2525
#include "gloo/rendezvous/prefix_store.h"
26+
27+
#ifdef __linux__
2628
#include "gloo/transport/tcp/device.h"
29+
using attr = gloo::transport::tcp::attr;
30+
constexpr auto CreateDevice = gloo::transport::tcp::CreateDevice;
31+
#else
32+
// Use uv on macOS as TCP requires epoll (Linux-only)
33+
#include "gloo/transport/uv/device.h"
34+
using attr = gloo::transport::uv::attr;
35+
constexpr auto CreateDevice = gloo::transport::uv::CreateDevice;
36+
#endif
2737

2838
#if HAVE_MPI
2939
#include "gloo/mpi/context.h"
@@ -98,10 +108,10 @@ void GlooContext::InitializeFromMPI(MPIContext& mpi_ctx,
98108

99109
// TODO(sihan): Add support for multiple interfaces:
100110
// https://github.com/facebookincubator/gloo/issues/190
101-
gloo::transport::tcp::attr attr;
102-
attr.iface = gloo_iface;
103-
attr.ai_family = AF_UNSPEC;
104-
auto dev = gloo::transport::tcp::CreateDevice(attr);
111+
attr device_attr;
112+
device_attr.iface = gloo_iface;
113+
device_attr.ai_family = AF_UNSPEC;
114+
auto dev = CreateDevice(device_attr);
105115
auto timeout = GetTimeoutFromEnv();
106116

107117
auto context =
@@ -129,14 +139,14 @@ void GlooContext::Initialize(const std::string& gloo_iface) {
129139
return;
130140
}
131141

132-
// Create a tcp device for communication
142+
// Create a device for communication
133143
// TODO(sihan): Add support for multiple interfaces:
134144
// https://github.com/facebookincubator/gloo/issues/190
135-
gloo::transport::tcp::attr attr;
136-
attr.iface = gloo_iface;
145+
attr device_attr;
146+
device_attr.iface = gloo_iface;
137147

138-
attr.ai_family = AF_UNSPEC;
139-
auto dev = gloo::transport::tcp::CreateDevice(attr);
148+
device_attr.ai_family = AF_UNSPEC;
149+
auto dev = CreateDevice(device_attr);
140150
auto timeout = GetTimeoutFromEnv();
141151

142152
auto host_env = std::getenv(HOROVOD_HOSTNAME);

0 commit comments

Comments
 (0)