Skip to content

Commit 58201a1

Browse files
alexeibfacebook-github-bot
authored andcommitted
migrate roberta glue finetuning to hydra (#2035)
Summary: this allows roberta finetuning on different tasks using yaml config files + hydra entry point Pull Request resolved: fairinternal/fairseq-py#2035 Reviewed By: Mortimerp9 Differential Revision: D29601732 Pulled By: alexeib fbshipit-source-id: 774ef974b4b40ad0ced76874c62047d0c46520e7
1 parent 7b710ac commit 58201a1

File tree

14 files changed

+617
-179
lines changed

14 files changed

+617
-179
lines changed

examples/roberta/README.glue.md

Lines changed: 5 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -17,54 +17,19 @@ Use `ALL` for preprocessing all the glue tasks.
1717
### 3) Fine-tuning on GLUE task:
1818
Example fine-tuning cmd for `RTE` task
1919
```bash
20-
TOTAL_NUM_UPDATES=2036 # 10 epochs through RTE for bsz 16
21-
WARMUP_UPDATES=122 # 6 percent of the number of updates
22-
LR=2e-05 # Peak LR for polynomial LR scheduler.
23-
NUM_CLASSES=2
24-
MAX_SENTENCES=16 # Batch size.
2520
ROBERTA_PATH=/path/to/roberta/model.pt
2621

27-
CUDA_VISIBLE_DEVICES=0 fairseq-train RTE-bin/ \
28-
--restore-file $ROBERTA_PATH \
29-
--max-positions 512 \
30-
--batch-size $MAX_SENTENCES \
31-
--max-tokens 4400 \
32-
--task sentence_prediction \
33-
--reset-optimizer --reset-dataloader --reset-meters \
34-
--required-batch-size-multiple 1 \
35-
--init-token 0 --separator-token 2 \
36-
--arch roberta_large \
37-
--criterion sentence_prediction \
38-
--num-classes $NUM_CLASSES \
39-
--dropout 0.1 --attention-dropout 0.1 \
40-
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
41-
--clip-norm 0.0 \
42-
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
43-
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
44-
--max-epoch 10 \
45-
--find-unused-parameters \
46-
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
22+
CUDA_VISIBLE_DEVICES=0 fairseq-hydra-train -config-dir examples/roberta/config/finetuning --config-name rte \
23+
task.data=RTE-bin checkpoint.restore_file=$ROBERTA_PATH
4724
```
4825

49-
For each of the GLUE task, you will need to use following cmd-line arguments:
50-
51-
Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
52-
---|---|---|---|---|---|---|---|---
53-
`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
54-
`--lr` | 1e-5 | 1e-5 | 1e-5 | 2e-5 | 1e-5 | 1e-5 | 1e-5 | 2e-5
55-
`--batch-size` | 32 | 32 | 32 | 16 | 32 | 16 | 16 | 16
56-
`--total-num-update` | 123873 | 33112 | 113272 | 2036 | 20935 | 2296 | 5336 | 3598
57-
`--warmup-updates` | 7432 | 1986 | 28318 | 122 | 1256 | 137 | 320 | 214
58-
59-
For `STS-B` additionally add `--regression-target --best-checkpoint-metric loss` and remove `--maximize-best-checkpoint-metric`.
26+
There are additional config files for each of the GLUE tasks in the examples/roberta/config/finetuning directory.
6027

6128
**Note:**
6229

63-
a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--batch-size=16/32` depending on the task.
64-
65-
b) Above cmd-args and hyperparams are tested on one Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--batch-size`.
30+
a) Above cmd-args and hyperparams are tested on one Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--batch-size`.
6631

67-
c) All the settings in above table are suggested settings based on our hyperparam search within a fixed search space (for careful comparison across models). You might be able to find better metrics with wider hyperparam search.
32+
b) All the settings in above table are suggested settings based on our hyperparam search within a fixed search space (for careful comparison across models). You might be able to find better metrics with wider hyperparam search.
6833

6934
### Inference on GLUE task
7035
After training the model as mentioned in previous step, you can perform inference with checkpoints in `checkpoints/` directory using following python code snippet:
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# @package _group_
2+
3+
common:
4+
fp16: true
5+
fp16_init_scale: 4
6+
threshold_loss_scale: 1
7+
fp16_scale_window: 128
8+
9+
task:
10+
_name: sentence_prediction
11+
data: ???
12+
init_token: 0
13+
separator_token: 2
14+
num_classes: 2
15+
max_positions: 512
16+
17+
checkpoint:
18+
restore_file: ???
19+
reset_optimizer: true
20+
reset_dataloader: true
21+
reset_meters: true
22+
best_checkpoint_metric: accuracy
23+
maximize_best_checkpoint_metric: true
24+
25+
distributed_training:
26+
find_unused_parameters: true
27+
distributed_world_size: 1
28+
29+
criterion:
30+
_name: sentence_prediction
31+
32+
dataset:
33+
batch_size: 16
34+
required_batch_size_multiple: 1
35+
max_tokens: 4400
36+
37+
optimizer:
38+
_name: adam
39+
weight_decay: 0.1
40+
adam_betas: (0.9,0.98)
41+
adam_eps: 1e-06
42+
43+
lr_scheduler:
44+
_name: polynomial_decay
45+
warmup_updates: 320
46+
47+
optimization:
48+
clip_norm: 0.0
49+
lr: [1e-05]
50+
max_update: 5336
51+
max_epoch: 10
52+
53+
model:
54+
_name: roberta_large
55+
dropout: 0.1
56+
attention_dropout: 0.1
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# @package _group_
2+
3+
common:
4+
fp16: true
5+
fp16_init_scale: 4
6+
threshold_loss_scale: 1
7+
fp16_scale_window: 128
8+
9+
task:
10+
_name: sentence_prediction
11+
data: ???
12+
init_token: 0
13+
separator_token: 2
14+
num_classes: 3
15+
max_positions: 512
16+
17+
checkpoint:
18+
restore_file: ???
19+
reset_optimizer: true
20+
reset_dataloader: true
21+
reset_meters: true
22+
best_checkpoint_metric: accuracy
23+
maximize_best_checkpoint_metric: true
24+
25+
distributed_training:
26+
find_unused_parameters: true
27+
distributed_world_size: 1
28+
29+
criterion:
30+
_name: sentence_prediction
31+
32+
dataset:
33+
batch_size: 32
34+
required_batch_size_multiple: 1
35+
max_tokens: 4400
36+
37+
optimizer:
38+
_name: adam
39+
weight_decay: 0.1
40+
adam_betas: (0.9,0.98)
41+
adam_eps: 1e-06
42+
43+
lr_scheduler:
44+
_name: polynomial_decay
45+
warmup_updates: 7432
46+
47+
optimization:
48+
clip_norm: 0.0
49+
lr: [1e-05]
50+
max_update: 123873
51+
max_epoch: 10
52+
53+
model:
54+
_name: roberta_large
55+
dropout: 0.1
56+
attention_dropout: 0.1
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# @package _group_
2+
3+
common:
4+
fp16: true
5+
fp16_init_scale: 4
6+
threshold_loss_scale: 1
7+
fp16_scale_window: 128
8+
9+
task:
10+
_name: sentence_prediction
11+
data: ???
12+
init_token: 0
13+
separator_token: 2
14+
num_classes: 2
15+
max_positions: 512
16+
17+
checkpoint:
18+
restore_file: ???
19+
reset_optimizer: true
20+
reset_dataloader: true
21+
reset_meters: true
22+
best_checkpoint_metric: accuracy
23+
maximize_best_checkpoint_metric: true
24+
25+
distributed_training:
26+
find_unused_parameters: true
27+
distributed_world_size: 1
28+
29+
criterion:
30+
_name: sentence_prediction
31+
32+
dataset:
33+
batch_size: 16
34+
required_batch_size_multiple: 1
35+
max_tokens: 4400
36+
37+
optimizer:
38+
_name: adam
39+
weight_decay: 0.1
40+
adam_betas: (0.9,0.98)
41+
adam_eps: 1e-06
42+
43+
lr_scheduler:
44+
_name: polynomial_decay
45+
warmup_updates: 137
46+
47+
optimization:
48+
clip_norm: 0.0
49+
lr: [1e-05]
50+
max_update: 2296
51+
max_epoch: 10
52+
53+
model:
54+
_name: roberta_large
55+
dropout: 0.1
56+
attention_dropout: 0.1
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# @package _group_
2+
3+
common:
4+
fp16: true
5+
fp16_init_scale: 4
6+
threshold_loss_scale: 1
7+
fp16_scale_window: 128
8+
9+
task:
10+
_name: sentence_prediction
11+
data: ???
12+
init_token: 0
13+
separator_token: 2
14+
num_classes: 2
15+
max_positions: 512
16+
17+
checkpoint:
18+
restore_file: ???
19+
reset_optimizer: true
20+
reset_dataloader: true
21+
reset_meters: true
22+
best_checkpoint_metric: accuracy
23+
maximize_best_checkpoint_metric: true
24+
25+
distributed_training:
26+
find_unused_parameters: true
27+
distributed_world_size: 1
28+
29+
criterion:
30+
_name: sentence_prediction
31+
32+
dataset:
33+
batch_size: 32
34+
required_batch_size_multiple: 1
35+
max_tokens: 4400
36+
37+
optimizer:
38+
_name: adam
39+
weight_decay: 0.1
40+
adam_betas: (0.9,0.98)
41+
adam_eps: 1e-06
42+
43+
lr_scheduler:
44+
_name: polynomial_decay
45+
warmup_updates: 1986
46+
47+
optimization:
48+
clip_norm: 0.0
49+
lr: [1e-05]
50+
max_update: 33112
51+
max_epoch: 10
52+
53+
model:
54+
_name: roberta_large
55+
dropout: 0.1
56+
attention_dropout: 0.1
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# @package _group_
2+
3+
common:
4+
fp16: true
5+
fp16_init_scale: 4
6+
threshold_loss_scale: 1
7+
fp16_scale_window: 128
8+
9+
task:
10+
_name: sentence_prediction
11+
data: ???
12+
init_token: 0
13+
separator_token: 2
14+
num_classes: 2
15+
max_positions: 512
16+
17+
checkpoint:
18+
restore_file: ???
19+
reset_optimizer: true
20+
reset_dataloader: true
21+
reset_meters: true
22+
best_checkpoint_metric: accuracy
23+
maximize_best_checkpoint_metric: true
24+
25+
distributed_training:
26+
find_unused_parameters: true
27+
distributed_world_size: 1
28+
29+
criterion:
30+
_name: sentence_prediction
31+
32+
dataset:
33+
batch_size: 32
34+
required_batch_size_multiple: 1
35+
max_tokens: 4400
36+
37+
optimizer:
38+
_name: adam
39+
weight_decay: 0.1
40+
adam_betas: (0.9,0.98)
41+
adam_eps: 1e-06
42+
43+
lr_scheduler:
44+
_name: polynomial_decay
45+
warmup_updates: 28318
46+
47+
optimization:
48+
clip_norm: 0.0
49+
lr: [1e-05]
50+
max_update: 113272
51+
max_epoch: 10
52+
53+
model:
54+
_name: roberta_large
55+
dropout: 0.1
56+
attention_dropout: 0.1

0 commit comments

Comments
 (0)