Fixed autotuning with horovodrun by excluding unset parameters from the environment, and added docs for autotune (horovod#1356)

tgaddair · web-flow · commit 88692848c38b · 2019-08-27T15:10:38.000-07:00
diff --git a/README.rst b/README.rst
@@ -306,6 +306,15 @@ Horovod has the ability to record the timeline of its activity, called Horovod T
 See `here <docs/timeline.rst>`__ for full details and usage instructions.
 
 
+Automated Performance Tuning
+----------------------------
+Selecting the right values to efficiently make use of Tensor Fusion and other advanced Horovod features can involve
+a good amount of trial and error. We provide a system to automate this performance optimization process called
+**autotuning**, which you can enable with a single command line argument to ``horovodrun``.
+
+See `here <docs/autotune.rst>`__ for full details and usage instructions.
+
+
 Guides
 ------
 1. Run distributed training in Microsoft Azure using `Batch AI and Horovod <https://github.com/Azure/BatchAI/tree/master/recipes/Horovod>`_. Send us links to any user guides you want to publish on this site
diff --git a/docs/autotune.rst b/docs/autotune.rst
@@ -0,0 +1,86 @@
+.. inclusion-marker-start-do-not-remove
+
+Autotune: Automated Performance Tuning
+======================================
+
+Horovod comes with several adjustable "knobs" that can affect runtime performance, including
+``--fusion-threshold-mb`` and ``--cycle-time-ms`` (tensor fusion), ``--cache-capacity`` (response cache), and
+hierarchical collective algorithms ``--hierarchical-allreduce`` and ``--hierarchical-allgather``.
+
+Determining the best combination of these values to maximize performance (minimize time to convergence) can be a
+matter of trial-and-error, as many factors including model complexity, network bandwidth, GPU memory, etc. can all
+affect inputs per second throughput during training.
+
+Horovod provides a mechanism to automate the process of selecting the best values for these "knobs" called
+**autotuning**. The Horovod autotuning system uses
+`Bayesian optimization <https://en.wikipedia.org/wiki/Bayesian_optimization>`_ to intelligently search through the
+space of parameter combinations during training. This feature can be enabled by setting the ``--autotune`` flag for
+``horovodrun``:
+
+.. code-block:: bash
+
+    $ horovodrun -np 4 --autotune python train.py
+
+When autotuning is enabled, Horovod will spend the first steps / epochs of training experimenting with different
+parameter values and collecting metrics on performance (measured in bytes allreduced / allgathered per unit of time).
+Once the experiment reaches convergence, or a set number of samples have been collected, the system will record the best
+combination of parameters discovered and continue to use them for the duration of training.
+
+A log of all parameter combinations explored (and the best values selected) can be recorded by providing
+the ``--autotune-log-file`` option to ``horovodrun``:
+
+.. code-block:: bash
+
+    $ horovodrun -np 4 --autotune --autotune-log-file /tmp/autotune_log.csv python train.py
+
+By logging the best parameters to a file, you can opt to set the best parameters discovered on the command line
+instead of re-running autotuning if training is paused and later resumed.
+
+Note that some configurable parameters, like tensor compression, are not included as part of the autotuning process
+because they can affect model convergence. The purpose of autotuning at this time is entirely to improve scaling
+efficiency without making any tradeoffs on model performance.
+
+
+Constant Parameters
+-------------------
+
+Sometimes you may wish to hold certain values constant and only tune the unspecified parameters. This can be
+accomplished by explicitly setting those values on the command line or in the config file provided
+by ``--config-file``:
+
+.. code-block:: bash
+
+    $ horovodrun -np 4 --autotune --cache-capacity 1024 --no-hierarchical-allgather python train.py
+
+In the above example, parameters ``cache-capacity`` and ``hierarchical-allgather`` will not be adjusted by
+autotuning.
+
+
+Advanced Autotuning
+-------------------
+
+Enabling autotuning imposes a tradeoff between degraded performance during the early phases of training in exchange for
+better performance later on. As such, it's generally recommended to use autotuning in situations where training is both
+expected to take a long time (many epochs on very large datasets) and where scaling efficiency has been found lacking
+using the default settings.
+
+You can tune the autotuning system itself to change the number of warmup samples (discarded samples at the beginning),
+steps per sample, and maximum samples:
+
+.. code-block:: bash
+
+    $ horovodrun -np 4 --autotune \
+    --autotune-warmup-samples 5 --autotune-steps-per-sample 20 --autotune-bayes-opt-max-samples 40 \
+    python train.py
+
+Increasing these values will generally improve the accuracy of the autotuning process at the cost of greater time
+spent in the autotuning process with degraded performance.
+
+Finally, for those familiar with the underlying theory of Bayesian optimization and Gaussian processes, you can tune
+the noise regularization term (alpha) to account for variance in your network or other system resources:
+
+.. code-block:: bash
+
+    $ horovodrun -np 4 --autotune --autotune-gaussian-process-noise 0.75 python train.py
+
+.. inclusion-marker-end-do-not-remove
diff --git a/docs/autotune_include.rst b/docs/autotune_include.rst
@@ -0,0 +1,3 @@
+.. include:: ./autotune.rst
+   :start-after: inclusion-marker-start-do-not-remove
+   :end-before: inclusion-marker-end-do-not-remove
diff --git a/docs/index.rst b/docs/index.rst
@@ -121,6 +121,8 @@ Guides
 
    timeline_include
 
+   autotune_include
+
    troubleshooting_include
 
    contributors_include
diff --git a/docs/mpirun.rst b/docs/mpirun.rst
@@ -112,6 +112,12 @@ Timeline:
 
     $ mpirun -x HOROVOD_TIMELINE=/path/to/timeline.json -x HOROVOD_TIMELINE_MARK_CYCLES=1 ... python train.py
 
+Autotuning:
+
+.. code-block:: bash
+
+    $ mpirun -x HOROVOD_AUTOTUNE=1 -x HOROVOD_AUTOTUNE_LOG=/tmp/autotune_log.csv ... python train.py
+
 Note that when using ``horovodrun``, any command line arguments will override values set in the environment.
 
 Hangs due to non-routed network interfaces
diff --git a/docs/summary.rst b/docs/summary.rst
@@ -323,6 +323,7 @@ to batch small *allreduce* operations, which results in improved performance. We
 
 See `here <tensor-fusion.rst>`__ for full details and tweaking instructions.
 
+
 Horovod Timeline
 ----------------
 Horovod has the ability to record the timeline of its activity, called Horovod Timeline.
@@ -334,6 +335,15 @@ Use Horovod timeline to analyze Horovod performance.
 See `here <timeline.rst>`__ for full details and usage instructions.
 
 
+Automated Performance Tuning
+----------------------------
+Selecting the right values to efficiently make use of Tensor Fusion and other advanced Horovod features can involve
+a good amount of trial and error. We provide a system to automate this performance optimization process called
+**autotuning**, which you can enable with a single command line argument to ``horovodrun``.
+
+See `here <autotune.rst>`__ for full details and usage instructions.
+
+
 Guides
 ------
 1. Run distributed training in Microsoft Azure using `Batch AI and Horovod <https://github.com/Azure/BatchAI/tree/master/recipes/Horovod>`_.
diff --git a/horovod/run/common/util/config_parser.py b/horovod/run/common/util/config_parser.py
@@ -92,7 +92,7 @@ def set_args_from_config(args, config, override_args):
 
 def _validate_arg_nonnegative(args, arg_name):
     value = getattr(args, arg_name)
-    if value < 0:
+    if value is not None and value < 0:
         raise ValueError('{}={} must be >= 0'.format(arg_name, value))
 
 
@@ -104,7 +104,8 @@ def validate_config_args(args):
     _validate_arg_nonnegative(args, 'autotune_steps_per_sample')
     _validate_arg_nonnegative(args, 'autotune_bayes_opt_max_samples')
 
-    if args.autotune_gaussian_process_noise < 0 or args.autotune_gaussian_process_noise > 1:
+    noise = args.autotune_gaussian_process_noise
+    if noise is not None and (noise < 0 or noise > 1):
         raise ValueError('{}={} must be in [0, 1]'.format('autotune_gaussian_process_noise',
                                                           args.autotune_gaussian_process_noise))
 
diff --git a/horovod/run/run.py b/horovod/run/run.py
@@ -314,7 +314,7 @@ class StoreOverrideAction(argparse.Action):
         def __init__(self,
                      option_strings,
                      dest,
-                     default=False,
+                     default=None,
                      type=None,
                      required=False,
                      help=None):
@@ -334,28 +334,35 @@ def __call__(self, parser, args, values, option_string=None):
     return StoreOverrideAction
 
 
-def make_override_true_action(override_args):
-    class StoreOverrideTrueAction(argparse.Action):
+def make_override_bool_action(override_args, bool_value):
+    class StoreOverrideBoolAction(argparse.Action):
         def __init__(self,
                      option_strings,
                      dest,
-                     default=False,
                      required=False,
                      help=None):
-            super(StoreOverrideTrueAction, self).__init__(
+            super(StoreOverrideBoolAction, self).__init__(
                 option_strings=option_strings,
                 dest=dest,
-                const=True,
+                const=bool_value,
                 nargs=0,
-                default=default,
+                default=None,
                 required=required,
                 help=help)
 
         def __call__(self, parser, args, values, option_string=None):
             override_args.add(self.dest)
             setattr(args, self.dest, self.const)
 
-    return StoreOverrideTrueAction
+    return StoreOverrideBoolAction
+
+
+def make_override_true_action(override_args):
+    return make_override_bool_action(override_args, True)
+
+
+def make_override_false_action(override_args):
+    return make_override_bool_action(override_args, False)
 
 
 def parse_args():
@@ -407,28 +414,43 @@ def parse_args():
                              'this argument, and will be overridden by any arguments that come after it.')
 
     group_params = parser.add_argument_group('tuneable parameter arguments')
-    group_params.add_argument('--fusion-threshold-mb', action=make_override_action(override_args), type=int, default=64,
+    group_params.add_argument('--fusion-threshold-mb', action=make_override_action(override_args),type=int,
                               help='Fusion buffer threshold in MB. This is the maximum amount of '
                                    'tensor data that can be fused together into a single batch '
                                    'during allreduce / allgather. Setting 0 disables tensor fusion. '
-                                   '(default: %(default)s)')
-    group_params.add_argument('--cycle-time-ms', action=make_override_action(override_args), type=float, default=5,
+                                   '(default: 64)')
+    group_params.add_argument('--cycle-time-ms', action=make_override_action(override_args), type=float,
                               help='Cycle time in ms. This is the delay between each tensor fusion '
                                    'cycle. The larger the cycle time, the more batching, but the '
                                    'greater latency between each allreduce / allgather operations. '
-                                   '(default: %(default)s)')
-    group_params.add_argument('--cache-capacity', action=make_override_action(override_args), type=int, default=1024,
+                                   '(default: 5')
+    group_params.add_argument('--cache-capacity', action=make_override_action(override_args), type=int,
                               help='Maximum number of tensor names that will be cached to reduce amount '
                                    'of coordination required between workers before performing allreduce / '
-                                   'allgather. (default: %(default)s)')
-    group_params.add_argument('--hierarchical-allreduce', action=make_override_true_action(override_args),
-                              help='Perform hierarchical allreduce between workers instead of ring allreduce. '
-                                   'Hierarchical allreduce performs a local allreduce / gather within a host, then '
-                                   'a parallel cross allreduce between equal local ranks across workers, and '
-                                   'finally a local gather.')
-    group_params.add_argument('--hierarchical-allgather', action=make_override_true_action(override_args),
-                              help='Perform hierarchical allgather between workers instead of ring allgather. See '
-                                   'hierarchical allreduce for algorithm details.')
+                                   'allgather. (default: 1024')
+
+    group_hierarchical_allreduce = group_params.add_mutually_exclusive_group()
+    group_hierarchical_allreduce.add_argument('--hierarchical-allreduce',
+                                              action=make_override_true_action(override_args),
+                                              help='Perform hierarchical allreduce between workers instead of '
+                                                   'ring allreduce. Hierarchical allreduce performs a local '
+                                                   'allreduce / gather within a host, then a parallel cross allreduce '
+                                                   'between equal local ranks across workers, and finally a '
+                                                   'local gather.')
+    group_hierarchical_allreduce.add_argument('--no-hierarchical-allreduce', dest='hierarchical_allreduce',
+                                              action=make_override_false_action(override_args),
+                                              help='Explicitly disable hierarchical allreduce to prevent autotuning '
+                                                   'from adjusting it.')
+
+    group_hierarchical_allgather = group_params.add_mutually_exclusive_group()
+    group_hierarchical_allgather.add_argument('--hierarchical-allgather',
+                                              action=make_override_true_action(override_args),
+                                              help='Perform hierarchical allgather between workers instead of '
+                                                   'ring allgather. See hierarchical allreduce for algorithm details.')
+    group_hierarchical_allgather.add_argument('--no-hierarchical-allgather', dest='hierarchical_allgather',
+                                              action=make_override_false_action(override_args),
+                                              help='Explicitly disable hierarchical allgather to prevent autotuning '
+                                                   'from adjusting it.')
 
     group_autotune = parser.add_argument_group('autotune arguments')
     group_autotune.add_argument('--autotune', action=make_override_true_action(override_args),
diff --git a/test/test_run.py b/test/test_run.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+.. include:: ./autotune.rst`
	`2`	`+ :start-after: inclusion-marker-start-do-not-remove`
	`3`	`+ :end-before: inclusion-marker-end-do-not-remove`