Skip to content

Commit 107d740

Browse files
authored
Added developer docs to help onboard new contributors (horovod#1190)
1 parent 6645749 commit 107d740

File tree

3 files changed

+182
-0
lines changed

3 files changed

+182
-0
lines changed

docs/contributors.rst

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
2+
.. inclusion-marker-start-do-not-remove
3+
4+
5+
Contributor Guide
6+
=================
7+
8+
This guide covers the process of contributing to Horovod as a developer.
9+
10+
11+
Environment Setup
12+
~~~~~~~~~~~~~~~~~
13+
14+
Clone the repository locally:
15+
16+
.. code-block:: bash
17+
18+
$ git clone --recursive https://github.com/horovod/horovod.git
19+
20+
Develop within a virtual environment to avoid dependency issues:
21+
22+
.. code-block:: bash
23+
24+
$ virtualenv env
25+
$ . env/bin/activate
26+
27+
We recommend installing package versions that match with those under test in
28+
`Buildkite <https://github.com/horovod/horovod/blob/master/.buildkite/gen-pipeline.sh>`__.
29+
30+
For example:
31+
32+
.. code-block:: bash
33+
34+
$ pip install tensorflow==1.14.0
35+
$ pip install keras==2.2.4
36+
$ pip install torch==1.1.0 torchvision
37+
$ pip install pytest
38+
$ pip install h5py future scipy mpi4py pyspark mxnet
39+
40+
41+
Build and Install
42+
~~~~~~~~~~~~~~~~~
43+
44+
First, uninstall any existing version of Horovod. Be sure to do this *outside* the Horovod root directory:
45+
46+
.. code-block:: bash
47+
48+
$ cd $HOME
49+
$ pip uninstall -y horovod
50+
$ cd -
51+
52+
From *inside* the Horovod root directory, remove any previous build artifacts and then install Horovod:
53+
54+
.. code-block:: bash
55+
56+
$ rm -rf build/ dist/
57+
$ HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_TENSORFLOW=1 python setup.py install
58+
59+
Set ``HOROVOD_WITHOUT_[FRAMEWORK]=1`` to disable building Horovod plugins for that framework.
60+
This is useful when you’re testing a feature of one framework in particular and wish to save time.
61+
62+
63+
Testing
64+
~~~~~~~
65+
66+
Horovod has unit tests for all frameworks you can run from the tests directory:
67+
68+
.. code-block:: bash
69+
70+
$ cd test
71+
$ mpirun -np 2 pytest -v
72+
73+
**Note:** You will need PySpark and Java to run the Spark tests.
74+
75+
**IMPORTANT:** Some tests contain GPU-only codepaths that will be skipped if running without GPU support.
76+
77+
78+
Adding Custom Operations
79+
~~~~~~~~~~~~~~~~~~~~~~~~
80+
81+
Operations in Horovod are used to transform Tensors across workers. Horovod currently supports operations that
82+
implement Broadcast, Allreduce, and Allgather interfaces. Gradients in Horovod are aggregated through
83+
Allreduce operations (with the exception of sparse gradients, which use Allgather).
84+
85+
All data transfer operations are implemented in the
86+
`horovod/common/ops <https://github.com/horovod/horovod/tree/master/horovod/common/ops>`__ directory. Implementations
87+
are organized by the collective communication library used to perform the operation (e.g.,
88+
`mpi_operations.cc <https://github.com/horovod/horovod/blob/master/horovod/common/ops/mpi_operations.cc>`__ for MPI).
89+
90+
To create a new custom operation, start by defining a new class that inherits from the base operation, in the file
91+
corresponding to the library you'll use to implement the operation:
92+
93+
.. code-block:: c++
94+
95+
class CustomAllreduce : public AllreduceOp {
96+
public:
97+
CustomAllreduce(MPIContext* mpi_context, HorovodGlobalState* global_state);
98+
99+
virtual ~CustomAllreduce() = default;
100+
101+
Status Execute(std::vector<TensorTableEntry>& entries, const Response& response) override;
102+
103+
bool Enabled(const ParameterManager& param_manager,
104+
const std::vector<TensorTableEntry>& entries,
105+
const Response& response) const override;
106+
107+
The ``Execute`` member function is responsible for performing the operation on a list of Tensors. The ``entries``
108+
parameter provides access to all the Tensor buffers and metadata that need to be processed,
109+
and the ``response`` parameter contains additional metadata including which devices are being used by different ranks.
110+
111+
``Enabled`` should return true if your operation can be performed on the given Tensor entries subject to the
112+
current parameter settings and response metadata.
113+
114+
Once you've written the implementation for your operation, add it to the ``OperationManager`` in the
115+
``CreateOperationManager`` function of
116+
`operations.cc <https://github.com/horovod/horovod/blob/master/horovod/common/operations.cc>`__. Because more than one
117+
operation may be *enabled* at a time, but only one will be performed on a given vector of Tensor entries, consider the
118+
order of your operation in the ``OperationManager`` vector before adding it in.
119+
120+
The first operations in the vector will be checked before those at the end, and the first operation that is *enabled*
121+
will be performed. Broadly, the order of operations should be:
122+
123+
1. Custom operations that trigger based on parameters configured at runtime (e.g., ``NCCLHierarchicalAllreduce``).
124+
2. Accelerated operations that take advantage of specialized hardware where available (e.g., ``NCCLAllreduce``).
125+
3. Default operations that can run using standard CPUs and host memory (e.g., ``MPIAllreduce``).
126+
127+
Most custom operations that require preconditions such as runtime flags will fall into the first category.
128+
129+
130+
Adding Compression Algorithms
131+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
132+
133+
Gradient compression is used to reduce the amount of data sent over the network during an Allreduce operation. Such
134+
compression algorithms are implemented per framework (TensorFlow, PyTorch, MXNet, etc.) in
135+
``horovod/[framework]/compression.py``
136+
(see: `TensorFlow <https://github.com/horovod/horovod/blob/master/horovod/tensorflow/compression.py>`__,
137+
`PyTorch <https://github.com/horovod/horovod/blob/master/horovod/torch/compression.py>`__).
138+
139+
To implement a new compression algorithm, first add a new class inheriting from ``Compressor``:
140+
141+
.. code-block:: python
142+
143+
class CustomCompressor(Compressor):
144+
@staticmethod
145+
def compress(tensor):
146+
# do something here ...
147+
return tensor_compressed, ctx
148+
149+
@staticmethod
150+
def decompress(tensor, ctx):
151+
# do something here ...
152+
return tensor_decompressed
153+
154+
The ``compress`` method takes a Tensor gradient and returns it in its compressed form, along with any additional context
155+
necessary to decompress the tensor back to its original form. Similarly, ``decompress`` takes in a compressed tensor
156+
with its context and returns a decompressed tensor. Compression can be done in pure Python, or in C++ using a custom
157+
op (e.g., in `mpi_ops.cc <https://github.com/horovod/horovod/blob/master/horovod/tensorflow/mpi_ops.cc>`__ for
158+
TensorFlow).
159+
160+
Once implemented, add your ``Compressor`` subclass to the ``Compressor`` class, which emulates an enumeration API:
161+
162+
.. code-block:: python
163+
164+
class Compression(object):
165+
# ...
166+
167+
custom = CustomCompressor
168+
169+
Finally, you can start using your new compressor by passing it to the ``DistributedOptimizer``:
170+
171+
.. code-block:: python
172+
173+
opt = hvd.DistributedOptimizer(opt, compression=hvd.Compression.custom)
174+
175+
176+
.. inclusion-marker-end-do-not-remove

docs/contributors_include.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.. include:: ./contributors.rst
2+
:start-after: inclusion-marker-start-do-not-remove
3+
:end-before: inclusion-marker-end-do-not-remove

docs/index.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,9 @@ Guides
111111

112112
troubleshooting_include
113113

114+
contributors_include
115+
116+
114117

115118
Indices and tables
116119
------------------

0 commit comments

Comments
 (0)