Skip to content

Commit 8869284

Browse files
authored
Fixed autotuning with horovodrun by excluding unset parameters from the environment, and added docs for autotune (horovod#1356)
1 parent a639de5 commit 8869284

File tree

9 files changed

+198
-44
lines changed

9 files changed

+198
-44
lines changed

README.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -306,6 +306,15 @@ Horovod has the ability to record the timeline of its activity, called Horovod T
306306
See `here <docs/timeline.rst>`__ for full details and usage instructions.
307307

308308

309+
Automated Performance Tuning
310+
----------------------------
311+
Selecting the right values to efficiently make use of Tensor Fusion and other advanced Horovod features can involve
312+
a good amount of trial and error. We provide a system to automate this performance optimization process called
313+
**autotuning**, which you can enable with a single command line argument to ``horovodrun``.
314+
315+
See `here <docs/autotune.rst>`__ for full details and usage instructions.
316+
317+
309318
Guides
310319
------
311320
1. Run distributed training in Microsoft Azure using `Batch AI and Horovod <https://github.com/Azure/BatchAI/tree/master/recipes/Horovod>`_. Send us links to any user guides you want to publish on this site

docs/autotune.rst

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
.. inclusion-marker-start-do-not-remove
2+
3+
Autotune: Automated Performance Tuning
4+
======================================
5+
6+
Horovod comes with several adjustable "knobs" that can affect runtime performance, including
7+
``--fusion-threshold-mb`` and ``--cycle-time-ms`` (tensor fusion), ``--cache-capacity`` (response cache), and
8+
hierarchical collective algorithms ``--hierarchical-allreduce`` and ``--hierarchical-allgather``.
9+
10+
Determining the best combination of these values to maximize performance (minimize time to convergence) can be a
11+
matter of trial-and-error, as many factors including model complexity, network bandwidth, GPU memory, etc. can all
12+
affect inputs per second throughput during training.
13+
14+
Horovod provides a mechanism to automate the process of selecting the best values for these "knobs" called
15+
**autotuning**. The Horovod autotuning system uses
16+
`Bayesian optimization <https://en.wikipedia.org/wiki/Bayesian_optimization>`_ to intelligently search through the
17+
space of parameter combinations during training. This feature can be enabled by setting the ``--autotune`` flag for
18+
``horovodrun``:
19+
20+
.. code-block:: bash
21+
22+
$ horovodrun -np 4 --autotune python train.py
23+
24+
When autotuning is enabled, Horovod will spend the first steps / epochs of training experimenting with different
25+
parameter values and collecting metrics on performance (measured in bytes allreduced / allgathered per unit of time).
26+
Once the experiment reaches convergence, or a set number of samples have been collected, the system will record the best
27+
combination of parameters discovered and continue to use them for the duration of training.
28+
29+
A log of all parameter combinations explored (and the best values selected) can be recorded by providing
30+
the ``--autotune-log-file`` option to ``horovodrun``:
31+
32+
.. code-block:: bash
33+
34+
$ horovodrun -np 4 --autotune --autotune-log-file /tmp/autotune_log.csv python train.py
35+
36+
By logging the best parameters to a file, you can opt to set the best parameters discovered on the command line
37+
instead of re-running autotuning if training is paused and later resumed.
38+
39+
Note that some configurable parameters, like tensor compression, are not included as part of the autotuning process
40+
because they can affect model convergence. The purpose of autotuning at this time is entirely to improve scaling
41+
efficiency without making any tradeoffs on model performance.
42+
43+
44+
Constant Parameters
45+
-------------------
46+
47+
Sometimes you may wish to hold certain values constant and only tune the unspecified parameters. This can be
48+
accomplished by explicitly setting those values on the command line or in the config file provided
49+
by ``--config-file``:
50+
51+
.. code-block:: bash
52+
53+
$ horovodrun -np 4 --autotune --cache-capacity 1024 --no-hierarchical-allgather python train.py
54+
55+
In the above example, parameters ``cache-capacity`` and ``hierarchical-allgather`` will not be adjusted by
56+
autotuning.
57+
58+
59+
Advanced Autotuning
60+
-------------------
61+
62+
Enabling autotuning imposes a tradeoff between degraded performance during the early phases of training in exchange for
63+
better performance later on. As such, it's generally recommended to use autotuning in situations where training is both
64+
expected to take a long time (many epochs on very large datasets) and where scaling efficiency has been found lacking
65+
using the default settings.
66+
67+
You can tune the autotuning system itself to change the number of warmup samples (discarded samples at the beginning),
68+
steps per sample, and maximum samples:
69+
70+
.. code-block:: bash
71+
72+
$ horovodrun -np 4 --autotune \
73+
--autotune-warmup-samples 5 --autotune-steps-per-sample 20 --autotune-bayes-opt-max-samples 40 \
74+
python train.py
75+
76+
Increasing these values will generally improve the accuracy of the autotuning process at the cost of greater time
77+
spent in the autotuning process with degraded performance.
78+
79+
Finally, for those familiar with the underlying theory of Bayesian optimization and Gaussian processes, you can tune
80+
the noise regularization term (alpha) to account for variance in your network or other system resources:
81+
82+
.. code-block:: bash
83+
84+
$ horovodrun -np 4 --autotune --autotune-gaussian-process-noise 0.75 python train.py
85+
86+
.. inclusion-marker-end-do-not-remove

docs/autotune_include.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.. include:: ./autotune.rst
2+
:start-after: inclusion-marker-start-do-not-remove
3+
:end-before: inclusion-marker-end-do-not-remove

docs/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,8 @@ Guides
121121

122122
timeline_include
123123

124+
autotune_include
125+
124126
troubleshooting_include
125127

126128
contributors_include

docs/mpirun.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,12 @@ Timeline:
112112
113113
$ mpirun -x HOROVOD_TIMELINE=/path/to/timeline.json -x HOROVOD_TIMELINE_MARK_CYCLES=1 ... python train.py
114114
115+
Autotuning:
116+
117+
.. code-block:: bash
118+
119+
$ mpirun -x HOROVOD_AUTOTUNE=1 -x HOROVOD_AUTOTUNE_LOG=/tmp/autotune_log.csv ... python train.py
120+
115121
Note that when using ``horovodrun``, any command line arguments will override values set in the environment.
116122

117123
Hangs due to non-routed network interfaces

docs/summary.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -323,6 +323,7 @@ to batch small *allreduce* operations, which results in improved performance. We
323323

324324
See `here <tensor-fusion.rst>`__ for full details and tweaking instructions.
325325

326+
326327
Horovod Timeline
327328
----------------
328329
Horovod has the ability to record the timeline of its activity, called Horovod Timeline.
@@ -334,6 +335,15 @@ Use Horovod timeline to analyze Horovod performance.
334335
See `here <timeline.rst>`__ for full details and usage instructions.
335336

336337

338+
Automated Performance Tuning
339+
----------------------------
340+
Selecting the right values to efficiently make use of Tensor Fusion and other advanced Horovod features can involve
341+
a good amount of trial and error. We provide a system to automate this performance optimization process called
342+
**autotuning**, which you can enable with a single command line argument to ``horovodrun``.
343+
344+
See `here <autotune.rst>`__ for full details and usage instructions.
345+
346+
337347
Guides
338348
------
339349
1. Run distributed training in Microsoft Azure using `Batch AI and Horovod <https://github.com/Azure/BatchAI/tree/master/recipes/Horovod>`_.

horovod/run/common/util/config_parser.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ def set_args_from_config(args, config, override_args):
9292

9393
def _validate_arg_nonnegative(args, arg_name):
9494
value = getattr(args, arg_name)
95-
if value < 0:
95+
if value is not None and value < 0:
9696
raise ValueError('{}={} must be >= 0'.format(arg_name, value))
9797

9898

@@ -104,7 +104,8 @@ def validate_config_args(args):
104104
_validate_arg_nonnegative(args, 'autotune_steps_per_sample')
105105
_validate_arg_nonnegative(args, 'autotune_bayes_opt_max_samples')
106106

107-
if args.autotune_gaussian_process_noise < 0 or args.autotune_gaussian_process_noise > 1:
107+
noise = args.autotune_gaussian_process_noise
108+
if noise is not None and (noise < 0 or noise > 1):
108109
raise ValueError('{}={} must be in [0, 1]'.format('autotune_gaussian_process_noise',
109110
args.autotune_gaussian_process_noise))
110111

horovod/run/run.py

Lines changed: 44 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,7 @@ class StoreOverrideAction(argparse.Action):
314314
def __init__(self,
315315
option_strings,
316316
dest,
317-
default=False,
317+
default=None,
318318
type=None,
319319
required=False,
320320
help=None):
@@ -334,28 +334,35 @@ def __call__(self, parser, args, values, option_string=None):
334334
return StoreOverrideAction
335335

336336

337-
def make_override_true_action(override_args):
338-
class StoreOverrideTrueAction(argparse.Action):
337+
def make_override_bool_action(override_args, bool_value):
338+
class StoreOverrideBoolAction(argparse.Action):
339339
def __init__(self,
340340
option_strings,
341341
dest,
342-
default=False,
343342
required=False,
344343
help=None):
345-
super(StoreOverrideTrueAction, self).__init__(
344+
super(StoreOverrideBoolAction, self).__init__(
346345
option_strings=option_strings,
347346
dest=dest,
348-
const=True,
347+
const=bool_value,
349348
nargs=0,
350-
default=default,
349+
default=None,
351350
required=required,
352351
help=help)
353352

354353
def __call__(self, parser, args, values, option_string=None):
355354
override_args.add(self.dest)
356355
setattr(args, self.dest, self.const)
357356

358-
return StoreOverrideTrueAction
357+
return StoreOverrideBoolAction
358+
359+
360+
def make_override_true_action(override_args):
361+
return make_override_bool_action(override_args, True)
362+
363+
364+
def make_override_false_action(override_args):
365+
return make_override_bool_action(override_args, False)
359366

360367

361368
def parse_args():
@@ -407,28 +414,43 @@ def parse_args():
407414
'this argument, and will be overridden by any arguments that come after it.')
408415

409416
group_params = parser.add_argument_group('tuneable parameter arguments')
410-
group_params.add_argument('--fusion-threshold-mb', action=make_override_action(override_args), type=int, default=64,
417+
group_params.add_argument('--fusion-threshold-mb', action=make_override_action(override_args),type=int,
411418
help='Fusion buffer threshold in MB. This is the maximum amount of '
412419
'tensor data that can be fused together into a single batch '
413420
'during allreduce / allgather. Setting 0 disables tensor fusion. '
414-
'(default: %(default)s)')
415-
group_params.add_argument('--cycle-time-ms', action=make_override_action(override_args), type=float, default=5,
421+
'(default: 64)')
422+
group_params.add_argument('--cycle-time-ms', action=make_override_action(override_args), type=float,
416423
help='Cycle time in ms. This is the delay between each tensor fusion '
417424
'cycle. The larger the cycle time, the more batching, but the '
418425
'greater latency between each allreduce / allgather operations. '
419-
'(default: %(default)s)')
420-
group_params.add_argument('--cache-capacity', action=make_override_action(override_args), type=int, default=1024,
426+
'(default: 5')
427+
group_params.add_argument('--cache-capacity', action=make_override_action(override_args), type=int,
421428
help='Maximum number of tensor names that will be cached to reduce amount '
422429
'of coordination required between workers before performing allreduce / '
423-
'allgather. (default: %(default)s)')
424-
group_params.add_argument('--hierarchical-allreduce', action=make_override_true_action(override_args),
425-
help='Perform hierarchical allreduce between workers instead of ring allreduce. '
426-
'Hierarchical allreduce performs a local allreduce / gather within a host, then '
427-
'a parallel cross allreduce between equal local ranks across workers, and '
428-
'finally a local gather.')
429-
group_params.add_argument('--hierarchical-allgather', action=make_override_true_action(override_args),
430-
help='Perform hierarchical allgather between workers instead of ring allgather. See '
431-
'hierarchical allreduce for algorithm details.')
430+
'allgather. (default: 1024')
431+
432+
group_hierarchical_allreduce = group_params.add_mutually_exclusive_group()
433+
group_hierarchical_allreduce.add_argument('--hierarchical-allreduce',
434+
action=make_override_true_action(override_args),
435+
help='Perform hierarchical allreduce between workers instead of '
436+
'ring allreduce. Hierarchical allreduce performs a local '
437+
'allreduce / gather within a host, then a parallel cross allreduce '
438+
'between equal local ranks across workers, and finally a '
439+
'local gather.')
440+
group_hierarchical_allreduce.add_argument('--no-hierarchical-allreduce', dest='hierarchical_allreduce',
441+
action=make_override_false_action(override_args),
442+
help='Explicitly disable hierarchical allreduce to prevent autotuning '
443+
'from adjusting it.')
444+
445+
group_hierarchical_allgather = group_params.add_mutually_exclusive_group()
446+
group_hierarchical_allgather.add_argument('--hierarchical-allgather',
447+
action=make_override_true_action(override_args),
448+
help='Perform hierarchical allgather between workers instead of '
449+
'ring allgather. See hierarchical allreduce for algorithm details.')
450+
group_hierarchical_allgather.add_argument('--no-hierarchical-allgather', dest='hierarchical_allgather',
451+
action=make_override_false_action(override_args),
452+
help='Explicitly disable hierarchical allgather to prevent autotuning '
453+
'from adjusting it.')
432454

433455
group_autotune = parser.add_argument_group('autotune arguments')
434456
group_autotune.add_argument('--autotune', action=make_override_true_action(override_args),

0 commit comments

Comments
 (0)