Skip to content

Commit 78cf6be

Browse files
authored
Add userguide for AdaSum (horovod#1809)
Signed-off-by: Aishwarya Bhandare <[email protected]>
1 parent 65de4c9 commit 78cf6be

14 files changed

+379
-0
lines changed

docs/adasum_user_guide.rst

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
.. inclusion-marker-start-do-not-remove
2+
3+
AdaSum with Horovod
4+
===================
5+
6+
The Adaptive Summation, or AdaSum, is a novel algorithm for improving distributed data parallel training of Deep Learning models. This improvement can be seen in different ways: reducing the number steps to achieve the same accuracy in some cases and allowing you to scale to more training workers without penalizing learning rate and convergence stability.
7+
AdaSum can be used with Horovod and PyTorch/TensorFlow.
8+
9+
|
10+
11+
.. Contents::
12+
13+
|
14+
15+
16+
Introduction to the AdaSum Algorithm
17+
======================================
18+
19+
20+
Scaling DNN training to many GPUs always comes at a convergence degradation. This is because with larger batch sizes, gradients are averaged and the learning rate per example is smaller. To address this, learning rate is usually scaled up, but this can lead to divergence of model parameters. AdaSum addresses these two issues without introducing any hyperparameter.
21+
22+
Suppose there are two almost-parallel gradients from two different GPUs, g1 and g2, and they need to be reduced as shown in the figure below. The two common practices for reductions are g1+g2, the gray vector, or (g1+g2)/2, the green vector. g1+g2 may cause divergence of the model since it is effectively moving in the direction of g1 or g2 by two times the magnitude of g1 or g2. Therefore, generally (g1+g2)/2 is safer and more desired. Note that (g1+g2)/2 penalizes both the components g1 and g2 equally.
23+
24+
.. image:: media/abc4d31f19a315321553564e2225615b.png
25+
26+
Now consider the two orthogonal gradients g1 and g2 in the figure below. Since g1 and g2 are in two different dimensions and independent of each other, g1+g2 may not cause divergence.
27+
28+
.. image:: media/173cffdbdc89620287996ac28ca4a9ae.png
29+
30+
Finally, consider the third scenario where g1 and g2 are neither parallel nor orthogonal as shown in the figure below. In such a case, where taking the sum might cause a divergence, AdaSum controls the effect of the overall gradient update by subtracting half of g1’s projection on g2(pink vector) from g2, subtracting half of g2’s projection on g1 (orange vector) from g1, and summing the two components together.
31+
32+
.. image:: media/afa201b07a16fb29829dd8390aa0cc07.png
33+
34+
.. image:: media/d9b318cc2d8c16fe4ade2fa73ad83ec6.png
35+
36+
This formula reduces to a sum when g1 and g2 are orthogonal and an average when g1 and g2 are parallel.
37+
38+
This idea extends to many gradients as well. Suppose there are 2\^n gradients coming from 2\^n different GPUs. AdaSum inductively takes pairs of gradients and reduces them using the method above until all of them are reduced into one gradient. Thus, AdaSum needs the number of nodes to be a power of 2 in the current implementation.
39+
40+
41+
The Distributed Optimizer for AdaSum
42+
======================================
43+
44+
45+
AdaSum uses the Distributed AdaSum Optimizer to update the weights of the model after each step. In the usual data-parallel training scenario, the gradients are calculated independently by backpropagating on all the nodes, doing a reduce (averaging the gradients) so that all the nodes now have the same gradients, and then updating the weights of the model.
46+
47+
The distributed optimizer for AdaSum first obtains the local gradients from the backpropagation step from the current local mini batch. Instead of performing the reduce at this point, it applies the optimization function to the local gradients to perform the weight update. Then, the delta, which is the difference in the weights before and after the update is obtained, which is then reduced instead of the gradients. Once all the workers have the same delta, the weight update step is then performed as the sum of the initial weights and delta.
48+
49+
Since the nature of AdaSum requires it to operate on the full magnitude of the gradient, the newly added distributed optimizer uses the difference in magnitude of weights between before and after the optimizer performs a step to deliver a more accurate estimation.
50+
51+
52+
Installation and Usage Instructions
53+
=====================================
54+
55+
56+
AdaSum can be used and experimented with Horovod and Pytorch/TensorFlow.
57+
58+
In addition, there are two options of using AdaSum with Horovod: with Message Passing Interface (MPI) and with `NCCL <https://developer.nvidia.com/nccl>`_.
59+
Any valid implementation of MPI can be used, but AdaSum has been tested with `OpenMPI <https://www.open-mpi.org/>`_ and `IntelMPI <https://software.intel.com/en-us/mpi-library>`_.
60+
61+
Setting up the environment
62+
--------------------------
63+
64+
Below are the requirements for running Horovod with AdaSum:
65+
66+
- cuda >= 6.0
67+
68+
- OpenMPI >= 3.0
69+
70+
- NCCL >= 2.0
71+
72+
- Pytorch >= 1.2.0 OR
73+
74+
- Tensorflow >= 1.11.0, < 2.0
75+
76+
- Horovod >= 0.18.2
77+
78+
*Using NCCL:*
79+
80+
If the **HOROVOD_GPU_ALLREDUCE=NCCL** flag is used to compile Horovod, NCCL is used instead. In this case, NCCL will be used for intra-node communication, and AdaSum will be used for inter-node communication.
81+
82+
Modes of Operation
83+
=====================
84+
85+
Adasum can be used in the following ways depending on the hardware setup available.
86+
87+
Pure CPU
88+
--------------------------
89+
90+
When dealing with a hardware setup of multiple nodes, each node having worker GPUs that are not connected by a high speed interconnect like `NVLink <https://www.nvidia.com/en-us/data-center/nvlink/>`_, where the communication happens through the CPU, AdaSum through MPI can be used for both intra-node and inter-node communication. In this case, all of the AdaSum ops are performed on the CPU.
91+
92+
If the hardware setup allows for a different mode like Ring or Hierarchical to be used, those must be used instead to get the highest performance benefit.
93+
94+
.. image:: media/7220c70747b40ab58fce2dc246958218.png
95+
96+
Ring
97+
--------------------------
98+
99+
On specifically configured machines (`DGX1 <https://www.nvidia.com/en-us/data-center/dgx-1/>`_ nodes with 8 GPUs each), the Ring mode can be used instead of the pure CPU mode. This mode is identical to the pure CPU mode for inter-node communication, but is able to do intra-node communication without going through the CPU. It does this by utilizing CUDA-aware MPI (OpenMPI built with `UCX <https://www.openucx.org/>`_ support) in order to allow direct GPU to GPU communication within nodes. This results in identical convergence benefits to pure CPU mode, but much better throughput on nodes that support it.
100+
101+
Ring mode is currently supported only on **DGX1** nodes having 8 GPUs each.
102+
103+
.. image:: media/4920a765a77fa6eeca28c5aceaa405ec.png
104+
105+
Hierarchical
106+
--------------------------
107+
108+
In cases where the hardware does not support Ring mode, but throughput higher than that of the pure CPU mode is desired, the hierarchical mode can be used instead.
109+
110+
The hierarchical mode functions similar to the Ring mode, except for using NCCL to do regular averaging intra-node, instead of using CUDA-aware MPI to do an AdaSum-like ring. Note that hierarchical also works on any hardware configuration, and is not limited to DGX1s.
111+
112+
In practice, hierarchical yields the best throughput, but lowers the convergence benefits of AdaSum due to some of the ops being regular averaging. As a rule of thumb, typically the convergence benefit degradation is insignificant on clusters with large numbers of nodes (\>=8), as in that case there are enough inter-node AdaSum ops being performed. This is the ideal Hierarchical scenario.
113+
114+
The other reason to use Hierarchical even on smaller clusters is when Ring mode is not supported, and CPU mode throughput is simply too low to be viable. Note that in these cases the convergence benefits compared to not using AdaSum at all might be minor.
115+
116+
The learning rate that should be used is equal to the best learning rate for a single worker (GPU) scaled by the number of GPUs locally on a node. On very large clusters, scaling this even more by another factor of 1.5-2.0x might give better results but is not guaranteed and should be tried only if scaling by just the local size is not sufficient for good convergence
117+
118+
.. image:: media/a254d38d0e56319c0507a16ea09df959.png
119+
120+
Modification to the code
121+
===========================
122+
123+
A new distributed optimizer has been added to both TensorFlow and Pytorch to support the AdaSum algorithm.
124+
125+
An optional parameter **op** has been added to DistributedOptimizer and allreduce API for users to specify which operation to perform.
126+
When **op=hvd.AdaSum** is specified, the new optimizer will be used.
127+
128+
AdaSum is highly effective in scaling to large batch sizes. The **backward_passes_per_step** parameter of the DistributedOptimizer can be used for gradient accumulation in order to scale to larger effective batch sizes without being limited by GPU memory.
129+
130+
TensorFlow
131+
--------------------------
132+
133+
- DistributedOptimizer
134+
135+
.. code-block:: python
136+
137+
opt = tf.train.AdamOptimizer(0.001)
138+
opt = hvd.DistributedOptimizer(opt, backward_passes_per_step=5, op=hvd.AdaSum)
139+
140+
- Allreduce
141+
142+
.. code-block:: python
143+
144+
hvd.allreduce(tensor, op=hvd.AdaSum)
145+
146+
Pytorch
147+
--------------------------
148+
149+
- DistributedOptimizer
150+
151+
.. code-block:: python
152+
153+
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
154+
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters(), compression=compression, backward_passes_per_step = 5, op=hvd.AdaSum)
155+
156+
- Allreduce
157+
158+
.. code-block:: python
159+
160+
hvd.allreduce(tensor, op=hvd.AdaSum)
161+
162+
Case Studies
163+
==============
164+
165+
166+
Square and Cubic optimization
167+
---------------------------------
168+
169+
**A simple case study to understand AdaSum’s behavior**
170+
171+
In order to understand the behavior and potential benefits of AdaSum as compared to Averaging, consider a simple experiment in squared optimization using AdaSum. Here, the goal is to estimate the coefficients of a polynomial of degree 2. The features are generated by randomly sampling a uniform distribution, and scaling by a factor of x_max which can be specified. This sets the complexity of the data that is used to estimate the coefficients. Additionally, the learning rate and the op to be used for Allreduce can be specified as well. The true label is calculated with the original true coefficients, without adding any noise.
172+
173+
In order to estimate the coefficients, Stochastic Gradient Descent is used. The training is stopped once the gradients are zero for two consecutive runs. This optimization can be run over a range of learning rates, number of workers and data range (set by x_max). This can also be modified to a cubic optimization problem.
174+
175+
This experiment can be run through the jupyter notebook `adasum_bench.ipynb <../examples/adasum_bench.ipynb>`_, with the models being defined in `adasum_small_model.py <../examples/adasum_small_model.py>`_.
176+
177+
On running experiments with a different number of workers, we can draw the following conclusions for this simple scenario with plain SGD as the optimizer:
178+
179+
- **On the number of steps for convergence:** For the same problem, AdaSum achieves the same accuracy (100% in this case) in lower number of steps as compared to averaging. Depending on the complexity of the problem, this reduction can be anywhere up to 50% for less complex square parameter optimization.
180+
181+
182+
183+
- **On scaling learning rate for higher number of workers**: For traditional averaging, when the number of workers is increased with local batch size the same, this increases the global batch size, causing a higher smoothing effect on the gradients. To increase the speed of convergence, it is recommended that the learning rate be scaled up by the number of workers as recommended in the paper `Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour <https://arxiv.org/abs/1706.02677>`_.
184+
185+
**From this example, we see that with AdaSum, the LR need not be scaled linearly with the number of workers, but a better scaling factor would be 2-2.5.**
186+
187+
188+
- **On using LR decay**: With AdaSum, we see that a form of regularization effect already takes place over the gradients. As the training progresses, the magnitude of the gradients reduces, simulating the same effect as that of decaying the learning rate. Although some decay might be necessary for training more complex models, this result must be kept in mind as the same extent of decay might not be necessary.
189+
190+
191+
MNIST
192+
---------
193+
194+
**Higher accuracy with the same number of steps**
195+
196+
Here, we test the applicability of the observations from the simple cubic optimization problem to training MNIST with AdaSum. By scaling the best learning rate for a single worker case by 2.5 while using AdaSum with higher number of nodes, we see that we consistently get better accuracy with the same number of steps as compared to averaging.
197+
198+
199+
|
200+
201+
Key Takeaways
202+
===============
203+
204+
|
205+
206+
- AdaSum ensures correct convergence behavior even with large effective batch sizes.
207+
208+
- As the number of ranks scales up, the learning rate does not need to be scaled linearly if using CPU to do AdaSum reduction. A good scaling factor would be between 2\-2.5 over the best learning rate for a single worker.
209+
210+
- If the HOROVOD_GPU_ALLREDUCE=NCCL flag is used to compile Horovod, the learning rate that should be used is equal to the best learning rate for a single worker (GPU) scaled by the number of GPUs locally on a node. On very large clusters, scaling this even more by another factor of 1.5\-2.0x might give better results but is not guaranteed and should be tried only if scaling by just the local size is not sufficient for good convergence.
211+
212+
- Pytorch training in fp16 format is not yet supported. Integration of Apex into the new optimizer to enabled full mixed precision training with AdaSum in Pytorch is a work in progress.
213+
214+
- When HOROVOD_GPU_ALLREDUCE=NCCL flag is used to compile Horovod and training is run on a single node, only averaging through NCCL library is used to perform reductions and no AdaSum algorithm will take place in this configuration.
215+
216+
.. inclusion-marker-end-do-not-remove

docs/adasum_user_guide_include.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.. include:: ./adasum_user_guide.rst
2+
:start-after: inclusion-marker-start-do-not-remove
3+
:end-before: inclusion-marker-end-do-not-remove

docs/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,8 @@ Guides
122122
lsf_include
123123

124124
tensor-fusion_include
125+
126+
adasum_user_guide_include
125127

126128
timeline_include
127129

3.12 KB
Loading
22.2 KB
Loading
325 Bytes
Loading
6.79 KB
Loading
22.3 KB
Loading
3.66 KB
Loading
2.43 KB
Loading

0 commit comments

Comments
 (0)