Skip to content

Commit 356ff69

Browse files
authored
Added command line arguments for Horovod knob environment variables, config file, and new knobs for autotuning (horovod#1345)
1 parent 6efd5dd commit 356ff69

File tree

15 files changed

+617
-47
lines changed

15 files changed

+617
-47
lines changed

docs/mpirun.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,26 @@ example below:
9494
Other MPI RDMA implementations may or may not benefit from disabling multithreading, so please consult vendor
9595
documentation.
9696

97+
Horovod Parameter Knobs
98+
-----------------------
99+
100+
Many of the configurable parameters available as command line arguments to ``horovodrun`` can be used with ``mpirun``
101+
through the use of environment variables.
102+
103+
Tensor Fusion:
104+
105+
.. code-block:: bash
106+
107+
$ mpirun -x HOROVOD_FUSION_THRESHOLD=33554432 -x HOROVOD_CYCLE_TIME=3.5 ... python train.py
108+
109+
Timeline:
110+
111+
.. code-block:: bash
112+
113+
$ mpirun -x HOROVOD_TIMELINE=/path/to/timeline.json -x HOROVOD_TIMELINE_MARK_CYCLES=1 ... python train.py
114+
115+
Note that when using ``horovodrun``, any command line arguments will override values set in the environment.
116+
97117
Hangs due to non-routed network interfaces
98118
------------------------------------------
99119

docs/tensor-fusion.rst

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,25 +16,22 @@ one reduction operation. The algorithm of Tensor Fusion is as follows:
1616
5. Copy data from the fusion buffer into the output tensors.
1717
6. Repeat until there are no more tensors to reduce in this cycle.
1818

19-
The fusion buffer size can be tweaked using the ``HOROVOD_FUSION_THRESHOLD`` environment variable:
19+
The fusion buffer size can be adjusted using the ``--fusion-threshold-mb`` command line argument to ``horovodrun``:
2020

2121
.. code-block:: bash
2222
23-
$ HOROVOD_FUSION_THRESHOLD=33554432 horovodrun -np 4 python train.py
23+
$ horovodrun -np 4 --fusion-threshold-mb 32 python train.py
2424
25-
26-
Setting the ``HOROVOD_FUSION_THRESHOLD`` environment variable to zero disables Tensor Fusion:
25+
Setting ``--fusion-threshold-mb`` to zero disables Tensor Fusion:
2726

2827
.. code-block:: bash
2928
30-
$ HOROVOD_FUSION_THRESHOLD=0 horovodrun -np 4 python train.py
31-
29+
$ horovodrun -np 4 --fusion-threshold-mb 0 python train.py
3230
33-
You can tweak time between cycles (defined in milliseconds) using the ``HOROVOD_CYCLE_TIME`` environment variable:
31+
You can tweak time between cycles (defined in milliseconds) using the ``--cycle-time-ms`` command line argument:
3432

3533
.. code-block:: bash
3634
37-
$ HOROVOD_CYCLE_TIME=3.5 horovodrun -np 4 python train.py
38-
35+
$ horovodrun -np 4 --cycle-time-ms 3.5 python train.py
3936
4037
.. inclusion-marker-end-do-not-remove

docs/timeline.rst

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,12 @@ Horovod has the ability to record the timeline of its activity, called Horovod T
99
:alt: Horovod Timeline
1010

1111

12-
To record a Horovod Timeline, set the ``HOROVOD_TIMELINE`` environment variable to the location of the timeline
12+
To record a Horovod Timeline, set the ``--timeline-filename`` command line argument to the location of the timeline
1313
file to be created. This file is only recorded on rank 0, but it contains information about activity of all workers.
1414

1515
.. code-block:: bash
1616
17-
$ HOROVOD_TIMELINE=/path/to/timeline.json horovodrun -np 4 python train.py
17+
$ horovodrun -np 4 --timeline-filename /path/to/timeline.json python train.py
1818
1919
2020
You can then open the timeline file using the ``chrome://tracing`` facility of the `Chrome <https://www.google.com/chrome/browser/>`__ browser.
@@ -49,13 +49,10 @@ Horovod performs work in cycles. These cycles are used to aid `Tensor Fusion <h
4949
:alt: Cycle Markers
5050

5151

52-
Since this information makes timeline view very crowded, it is not enabled by default. To add cycle markers to the timeline, set the ``HOROVOD_TIMELINE_MARK_CYCLES`` environment variable to ``1``:
52+
Since this information makes timeline view very crowded, it is not enabled by default. To add cycle markers to the timeline, set the ``--timeline-mark-cycles`` flag:
5353

5454
.. code-block:: bash
5555
56-
$ HOROVOD_TIMELINE=/path/to/timeline.json HOROVOD_TIMELINE_MARK_CYCLES=1 \
57-
horovodrun -np 4 python train.py
58-
59-
56+
$ horovodrun -np 4 --timeline-filename /path/to/timeline.json --timeline-mark-cycles python train.py
6057
6158
.. inclusion-marker-end-do-not-remove

horovod/common/common.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,10 @@ namespace common {
6363
#define HOROVOD_TIMELINE_MARK_CYCLES "HOROVOD_TIMELINE_MARK_CYCLES"
6464
#define HOROVOD_AUTOTUNE "HOROVOD_AUTOTUNE"
6565
#define HOROVOD_AUTOTUNE_LOG "HOROVOD_AUTOTUNE_LOG"
66+
#define HOROVOD_AUTOTUNE_WARMUP_SAMPLES "HOROVOD_AUTOTUNE_WARMUP_SAMPLES"
67+
#define HOROVOD_AUTOTUNE_STEPS_PER_SAMPLE "HOROVOD_AUTOTUNE_STEPS_PER_SAMPLE"
68+
#define HOROVOD_AUTOTUNE_BAYES_OPT_MAX_SAMPLES "HOROVOD_AUTOTUNE_BAYES_OPT_MAX_SAMPLES"
69+
#define HOROVOD_AUTOTUNE_GAUSSIAN_PROCESS_NOISE "HOROVOD_AUTOTUNE_GAUSSIAN_PROCESS_NOISE"
6670
#define HOROVOD_FUSION_THRESHOLD "HOROVOD_FUSION_THRESHOLD"
6771
#define HOROVOD_CYCLE_TIME "HOROVOD_CYCLE_TIME"
6872
#define HOROVOD_STALL_CHECK_DISABLE "HOROVOD_STALL_CHECK_DISABLE"

horovod/common/parameter_manager.cc

Lines changed: 23 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,15 @@
2020
#include <limits>
2121

2222
#include "logging.h"
23+
#include "utils/env_parser.h"
2324

2425
namespace horovod {
2526
namespace common {
2627

27-
#define WARMUPS 3
28-
#define CYCLES_PER_SAMPLE 10
29-
#define BAYES_OPT_MAX_SAMPLES 20
30-
#define GAUSSIAN_PROCESS_NOISE 0.8
28+
#define DEFAULT_WARMUPS 3
29+
#define DEFAULT_STEPS_PER_SAMPLE 10
30+
#define DEFAULT_BAYES_OPT_MAX_SAMPLES 20
31+
#define DEFAULT_GAUSSIAN_PROCESS_NOISE 0.8
3132

3233
Eigen::VectorXd CreateVector(double x1, double x2) {
3334
Eigen::VectorXd v(2);
@@ -38,23 +39,28 @@ Eigen::VectorXd CreateVector(double x1, double x2) {
3839

3940
// ParameterManager
4041
ParameterManager::ParameterManager() :
42+
warmups_(GetIntEnvOrDefault(HOROVOD_AUTOTUNE_WARMUP_SAMPLES, DEFAULT_WARMUPS)),
43+
steps_per_sample_(GetIntEnvOrDefault(HOROVOD_AUTOTUNE_STEPS_PER_SAMPLE, DEFAULT_STEPS_PER_SAMPLE)),
4144
hierarchical_allreduce_(CategoricalParameter<bool>(std::vector<bool>{false, true})),
4245
hierarchical_allgather_(CategoricalParameter<bool>(std::vector<bool>{false, true})),
4346
cache_enabled_(CategoricalParameter<bool>(std::vector<bool>{false, true})),
4447
joint_params_(BayesianParameter(
4548
std::vector<BayesianVariableConfig>{
4649
{ BayesianVariable::fusion_buffer_threshold_mb, std::pair<double, double>(0, 64) },
4750
{ BayesianVariable::cycle_time_ms, std::pair<double, double>(1, 100) }
48-
}, std::vector<Eigen::VectorXd>{
51+
},
52+
std::vector<Eigen::VectorXd>{
4953
CreateVector(4, 5),
5054
CreateVector(32, 50),
5155
CreateVector(16, 25),
5256
CreateVector(8, 10)
53-
})),
57+
},
58+
GetIntEnvOrDefault(HOROVOD_AUTOTUNE_BAYES_OPT_MAX_SAMPLES, DEFAULT_BAYES_OPT_MAX_SAMPLES),
59+
GetDoubleEnvOrDefault(HOROVOD_AUTOTUNE_GAUSSIAN_PROCESS_NOISE, DEFAULT_GAUSSIAN_PROCESS_NOISE))),
5460
parameter_chain_(std::vector<ITunableParameter*>{&joint_params_, &hierarchical_allreduce_, &hierarchical_allgather_,
5561
&cache_enabled_}),
5662
active_(false),
57-
warmup_remaining_(WARMUPS),
63+
warmup_remaining_(warmups_),
5864
sample_(0),
5965
rank_(-1),
6066
root_rank_(0),
@@ -80,7 +86,7 @@ void ParameterManager::Initialize(int32_t rank, int32_t root_rank,
8086

8187
void ParameterManager::SetAutoTuning(bool active) {
8288
if (active != active_) {
83-
warmup_remaining_ = WARMUPS;
89+
warmup_remaining_ = warmups_;
8490
}
8591
active_ = active;
8692
};
@@ -140,8 +146,8 @@ bool ParameterManager::Update(const std::vector<std::string>& tensor_names,
140146
}
141147

142148
for (const std::string& tensor_name : tensor_names) {
143-
int32_t cycle = tensor_counts_[tensor_name]++;
144-
if (cycle >= (sample_ + 1) * CYCLES_PER_SAMPLE) {
149+
int32_t step = tensor_counts_[tensor_name]++;
150+
if (step >= (sample_ + 1) * steps_per_sample_) {
145151
auto now = std::chrono::steady_clock::now();
146152
double duration = std::chrono::duration_cast<std::chrono::microseconds>(now - last_sample_start_).count();
147153
scores_[sample_] = total_bytes_ / duration;
@@ -391,10 +397,14 @@ void ParameterManager::CategoricalParameter<T>::ResetState() {
391397
// BayesianParameter
392398
ParameterManager::BayesianParameter::BayesianParameter(
393399
std::vector<BayesianVariableConfig> variables,
394-
std::vector<Eigen::VectorXd> test_points) :
400+
std::vector<Eigen::VectorXd> test_points,
401+
int max_samples,
402+
double gaussian_process_noise) :
395403
TunableParameter<Eigen::VectorXd>(test_points[0]),
396404
variables_(variables),
397405
test_points_(test_points),
406+
max_samples_(max_samples),
407+
gaussian_process_noise_(gaussian_process_noise),
398408
iteration_(0) {
399409
ResetBayes();
400410
Reinitialize(FilterTestPoint(0));
@@ -453,7 +463,7 @@ void ParameterManager::BayesianParameter::OnTune(double score, Eigen::VectorXd&
453463
}
454464

455465
bool ParameterManager::BayesianParameter::IsDoneTuning() const {
456-
return iteration_ > BAYES_OPT_MAX_SAMPLES;
466+
return iteration_ > max_samples_;
457467
}
458468

459469
void ParameterManager::BayesianParameter::ResetState() {
@@ -474,7 +484,7 @@ void ParameterManager::BayesianParameter::ResetBayes() {
474484
}
475485
}
476486

477-
bayes_.reset(new BayesianOptimization(bounds, GAUSSIAN_PROCESS_NOISE));
487+
bayes_.reset(new BayesianOptimization(bounds, gaussian_process_noise_));
478488
}
479489

480490
Eigen::VectorXd ParameterManager::BayesianParameter::FilterTestPoint(int i) {

horovod/common/parameter_manager.h

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,8 @@ class ParameterManager {
185185
// A set of numerical parameters optimized jointly using Bayesian Optimization.
186186
class BayesianParameter : public TunableParameter<Eigen::VectorXd> {
187187
public:
188-
BayesianParameter(std::vector<BayesianVariableConfig> variables, std::vector<Eigen::VectorXd> test_points);
188+
BayesianParameter(std::vector<BayesianVariableConfig> variables, std::vector<Eigen::VectorXd> test_points,
189+
int max_samples, double gaussian_process_noise);
189190

190191
void SetValue(BayesianVariable variable, double value, bool fixed);
191192
double Value(BayesianVariable variable) const;
@@ -201,6 +202,9 @@ class ParameterManager {
201202

202203
std::vector<BayesianVariableConfig> variables_;
203204
std::vector<Eigen::VectorXd> test_points_;
205+
int max_samples_;
206+
double gaussian_process_noise_;
207+
204208
uint32_t iteration_;
205209

206210
struct EnumClassHash {
@@ -215,6 +219,9 @@ class ParameterManager {
215219
std::unordered_map<BayesianVariable, int32_t, EnumClassHash> index_;
216220
};
217221

222+
int warmups_;
223+
int steps_per_sample_;
224+
218225
CategoricalParameter<bool> hierarchical_allreduce_;
219226
CategoricalParameter<bool> hierarchical_allgather_;
220227
CategoricalParameter<bool> cache_enabled_;
@@ -236,7 +243,6 @@ class ParameterManager {
236243
int32_t root_rank_;
237244
std::ofstream file_;
238245
bool writing_;
239-
240246
};
241247

242248
} // namespace common

horovod/common/utils/env_parser.cc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,5 +154,10 @@ int GetIntEnvOrDefault(const char* env_variable, int default_value) {
154154
return env_value != nullptr ? std::strtol(env_value, nullptr, 10) : default_value;
155155
}
156156

157+
double GetDoubleEnvOrDefault(const char* env_variable, double default_value) {
158+
auto env_value = std::getenv(env_variable);
159+
return env_value != nullptr ? std::strtod(env_value, nullptr) : default_value;
160+
}
161+
157162
} // namespace common
158163
}

horovod/common/utils/env_parser.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ void SetIntFromEnv(const char* env, int& val);
4141

4242
int GetIntEnvOrDefault(const char* env_variable, int default_value);
4343

44+
double GetDoubleEnvOrDefault(const char* env_variable, double default_value);
45+
4446
} // namespace common
4547
} // namespace horovod
4648

0 commit comments

Comments
 (0)