Skip to content

Commit a379e9c

Browse files
authored
Merge pull request #970 from huggingface/improve-unit3-rerelease
Improve unit3 for rerelease
2 parents 341e623 + c02318e commit a379e9c

File tree

9 files changed

+1314
-603
lines changed

9 files changed

+1314
-603
lines changed

chapters/en/_toctree.yml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,13 +58,14 @@
5858
- local: chapter3/2
5959
title: Processing the data
6060
- local: chapter3/3
61-
title: Fine-tuning a model with the Trainer API or Keras
62-
local_fw: { pt: chapter3/3, tf: chapter3/3_tf }
61+
title: Fine-tuning a model with the Trainer API
6362
- local: chapter3/4
64-
title: A full training
63+
title: A full training loop
6564
- local: chapter3/5
6665
title: Fine-tuning, Check!
6766
- local: chapter3/6
67+
title: Understanding Learning Curves
68+
- local: chapter3/7
6869
title: End-of-chapter quiz
6970
quiz: 3
7071

chapters/en/chapter3/1.mdx

Lines changed: 28 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,35 @@
77
classNames="absolute z-10 right-0 top-0"
88
/>
99

10-
In [Chapter 2](/course/chapter2) we explored how to use tokenizers and pretrained models to make predictions. But what if you want to fine-tune a pretrained model for your own dataset? That's the topic of this chapter! You will learn:
10+
In [Chapter 2](/course/chapter2) we explored how to use tokenizers and pretrained models to make predictions. But what if you want to fine-tune a pretrained model to solve a specific task? That's the topic of this chapter! You will learn:
1111

12-
{#if fw === 'pt'}
13-
* How to prepare a large dataset from the Hub
14-
* How to use the high-level `Trainer` API to fine-tune a model
15-
* How to use a custom training loop
16-
* How to leverage the 🤗 Accelerate library to easily run that custom training loop on any distributed setup
12+
* How to prepare a large dataset from the Hub using the latest 🤗 Datasets features
13+
* How to use the high-level `Trainer` API to fine-tune a model with modern best practices
14+
* How to implement a custom training loop with optimization techniques
15+
* How to leverage the 🤗 Accelerate library to easily run distributed training on any setup
16+
* How to apply current fine-tuning best practices for maximum performance
1717

18-
{:else}
19-
* How to prepare a large dataset from the Hub
20-
* How to use Keras to fine-tune a model
21-
* How to use Keras to get predictions
22-
* How to use a custom metric
18+
<Tip>
2319

24-
{/if}
20+
📚 **Essential Resources**: Before starting, you might want to review the [🤗 Datasets documentation](https://huggingface.co/docs/datasets/) for data processing.
2521

26-
In order to upload your trained checkpoints to the Hugging Face Hub, you will need a huggingface.co account: [create an account](https://huggingface.co/join)
22+
</Tip>
23+
24+
This chapter will also serve as an introduction to some Hugging Face libraries beyond the 🤗 Transformers library! We'll see how libraries like 🤗 Datasets, 🤗 Tokenizers, 🤗 Accelerate, and 🤗 Evaluate can help you train models more efficiently and effectively.
25+
26+
Each of the main sections in this chapter will teach you something different:
27+
- **Section 2**: Learn modern data preprocessing techniques and efficient dataset handling
28+
- **Section 3**: Master the powerful Trainer API with all its latest features
29+
- **Section 4**: Implement training loops from scratch and understand distributed training with Accelerate
30+
31+
By the end of this chapter, you'll be able to fine-tune models on your own datasets using both high-level APIs and custom training loops, applying the latest best practices in the field.
32+
33+
<Tip>
34+
35+
🎯 **What You'll Build**: By the end of this chapter, you'll have fine-tuned a BERT model for text classification and understand how to adapt the techniques to your own datasets and tasks.
36+
37+
</Tip>
38+
39+
This chapter focuses exclusively on **PyTorch**, as it has become the standard framework for modern deep learning research and production. We'll use the latest APIs and best practices from the Hugging Face ecosystem.
40+
41+
To upload your trained models to the Hugging Face Hub, you will need a Hugging Face account: [create an account](https://huggingface.co/join)

chapters/en/chapter3/2.mdx

Lines changed: 161 additions & 101 deletions
Large diffs are not rendered by default.

chapters/en/chapter3/3.mdx

Lines changed: 228 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,13 @@
1111

1212
<Youtube id="nvBXf7s7vTI"/>
1313

14-
🤗 Transformers provides a `Trainer` class to help you fine-tune any of the pretrained models it provides on your dataset. Once you've done all the data preprocessing work in the last section, you have just a few steps left to define the `Trainer`. The hardest part is likely to be preparing the environment to run `Trainer.train()`, as it will run very slowly on a CPU. If you don't have a GPU set up, you can get access to free GPUs or TPUs on [Google Colab](https://colab.research.google.com/).
14+
🤗 Transformers provides a `Trainer` class to help you fine-tune any of the pretrained models it provides on your dataset with modern best practices. Once you've done all the data preprocessing work in the last section, you have just a few steps left to define the `Trainer`. The hardest part is likely to be preparing the environment to run `Trainer.train()`, as it will run very slowly on a CPU. If you don't have a GPU set up, you can get access to free GPUs or TPUs on [Google Colab](https://colab.research.google.com/).
15+
16+
<Tip>
17+
18+
📚 **Training Resources**: Before diving into training, familiarize yourself with the comprehensive [🤗 Transformers training guide](https://huggingface.co/docs/transformers/main/en/training) and explore practical examples in the [fine-tuning cookbook](https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu).
19+
20+
</Tip>
1521

1622
The code examples below assume you have already executed the examples in the previous section. Here is a short summary recapping what you need:
1723

@@ -42,9 +48,11 @@ from transformers import TrainingArguments
4248
training_args = TrainingArguments("test-trainer")
4349
```
4450

51+
If you want to automatically upload your model to the Hub during training, pass along `push_to_hub=True` in the `TrainingArguments`. We will learn more about this in [Chapter 4](/course/chapter4/3)
52+
4553
<Tip>
4654

47-
💡 If you want to automatically upload your model to the Hub during training, pass along `push_to_hub=True` in the `TrainingArguments`. We will learn more about this in [Chapter 4](/course/chapter4/3)
55+
🚀 **Advanced Configuration**: For detailed information on all available training arguments and optimization strategies, check out the [TrainingArguments documentation](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) and the [training configuration cookbook](https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu).
4856

4957
</Tip>
5058

@@ -58,7 +66,7 @@ model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_label
5866

5967
You will notice that unlike in [Chapter 2](/course/chapter2), you get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.
6068

61-
Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now — the `model`, the `training_args`, the training and validation datasets, our `data_collator`, and our `processing_class` (e.g., a tokenizer, feature extractor, or processor):
69+
Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now — the `model`, the `training_args`, the training and validation datasets, our `data_collator`, and our `processing_class`. The `processing_class` parameter is a newer addition that tells the Trainer which tokenizer to use for processing:
6270

6371
```py
6472
from transformers import Trainer
@@ -73,7 +81,13 @@ trainer = Trainer(
7381
)
7482
```
7583

76-
Note that when you pass a tokenizer as the `processing_class`, as we did here, the default `data_collator` used by the `Trainer` will be a `DataCollatorWithPadding` if the `processing_class` is a tokenizer or feature extractor, so you can skip the line `data_collator=data_collator` in this call. It was still important to show you this part of the processing in section 2!
84+
When you pass a tokenizer as the `processing_class`, the default `data_collator` used by the `Trainer` will be a `DataCollatorWithPadding`. You can skip the `data_collator=data_collator` line in this case, but we included it here to show you this important part of the processing pipeline.
85+
86+
<Tip>
87+
88+
📖 **Learn More**: For comprehensive details on the Trainer class and its parameters, visit the [Trainer API documentation](https://huggingface.co/docs/transformers/main/en/main_classes/trainer) and explore advanced usage patterns in the [training cookbook recipes](https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu).
89+
90+
</Tip>
7791

7892
To fine-tune the model on our dataset, we just have to call the `train()` method of our `Trainer`:
7993

@@ -123,6 +137,12 @@ metric.compute(predictions=preds, references=predictions.label_ids)
123137
{'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}
124138
```
125139

140+
<Tip>
141+
142+
Learn about different evaluation metrics and strategies in the [🤗 Evaluate documentation](https://huggingface.co/docs/evaluate/).
143+
144+
</Tip>
145+
126146
The exact results you get may vary, as the random initialization of the model head might change the metrics it achieved. Here, we can see our model has an accuracy of 85.78% on the validation set and an F1 score of 89.97. Those are the two metrics used to evaluate results on the MRPC dataset for the GLUE benchmark. The table in the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf) reported an F1 score of 88.9 for the base model. That was the `uncased` model while we are currently using the `cased` model, which explains the better result.
127147

128148
Wrapping everything together, we get our `compute_metrics()` function:
@@ -160,13 +180,214 @@ trainer.train()
160180

161181
This time, it will report the validation loss and metrics at the end of each epoch on top of the training loss. Again, the exact accuracy/F1 score you reach might be a bit different from what we found, because of the random head initialization of the model, but it should be in the same ballpark.
162182

163-
The `Trainer` will work out of the box on multiple GPUs or TPUs and provides lots of options, like mixed-precision training (use `fp16 = True` in your training arguments). We will go over everything it supports in Chapter 10.
183+
### Advanced Training Features[[advanced-training-features]]
184+
185+
The `Trainer` comes with many built-in features that make modern deep learning best practices accessible:
186+
187+
**Mixed Precision Training**: Use `fp16=True` in your training arguments for faster training and reduced memory usage:
188+
189+
```py
190+
training_args = TrainingArguments(
191+
"test-trainer",
192+
eval_strategy="epoch",
193+
fp16=True, # Enable mixed precision
194+
)
195+
```
196+
197+
**Gradient Accumulation**: For effective larger batch sizes when GPU memory is limited:
198+
199+
```py
200+
training_args = TrainingArguments(
201+
"test-trainer",
202+
eval_strategy="epoch",
203+
per_device_train_batch_size=4,
204+
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
205+
)
206+
```
207+
208+
**Learning Rate Scheduling**: The Trainer uses linear decay by default, but you can customize this:
209+
210+
```py
211+
training_args = TrainingArguments(
212+
"test-trainer",
213+
eval_strategy="epoch",
214+
learning_rate=2e-5,
215+
lr_scheduler_type="cosine", # Try different schedulers
216+
)
217+
```
218+
219+
<Tip>
220+
221+
🎯 **Performance Optimization**: For more advanced training techniques including distributed training, memory optimization, and hardware-specific optimizations, explore the [🤗 Transformers performance guide](https://huggingface.co/docs/transformers/main/en/performance).
222+
223+
</Tip>
224+
225+
The `Trainer` will work out of the box on multiple GPUs or TPUs and provides lots of options for distributed training. We will go over everything it supports in Chapter 10.
226+
227+
This concludes the introduction to fine-tuning using the `Trainer` API. An example of doing this for most common NLP tasks will be given in [Chapter 7](/course/chapter7), but for now let's look at how to do the same thing with a pure PyTorch training loop.
228+
229+
<Tip>
230+
231+
📝 **More Examples**: Check out the comprehensive collection of [🤗 Transformers notebooks](https://huggingface.co/docs/transformers/main/en/notebooks).
232+
233+
</Tip>
164234

165-
This concludes the introduction to fine-tuning using the `Trainer` API. An example of doing this for most common NLP tasks will be given in [Chapter 7](/course/chapter7), but for now let's look at how to do the same thing in pure PyTorch.
235+
## Section Quiz[[section-quiz]]
236+
237+
Test your understanding of the Trainer API and fine-tuning concepts:
238+
239+
### 1. What is the purpose of the <code>processing_class</code> parameter in the Trainer?
240+
241+
<Question
242+
choices={[
243+
{
244+
text: "It specifies which model architecture to use.",
245+
explain: "Model architecture is specified when loading the model, not in the Trainer."
246+
},
247+
{
248+
text: "It tells the Trainer which tokenizer to use for processing data.",
249+
explain: "The processing_class parameter is a modern addition that helps the Trainer know which tokenizer to use.",
250+
correct: true
251+
},
252+
{
253+
text: "It determines the batch size for training.",
254+
explain: "Batch size is set in TrainingArguments, not through processing_class."
255+
},
256+
{
257+
text: "It controls the evaluation frequency.",
258+
explain: "Evaluation frequency is controlled by eval_strategy in TrainingArguments."
259+
}
260+
]}
261+
/>
262+
263+
### 2. Which TrainingArguments parameter controls how often evaluation occurs during training?
264+
265+
<Question
266+
choices={[
267+
{
268+
text: "eval_frequency",
269+
explain: "There's no eval_frequency parameter in TrainingArguments."
270+
},
271+
{
272+
text: "eval_strategy",
273+
explain: "eval_strategy can be set to 'epoch', 'steps', or 'no' to control evaluation timing.",
274+
correct: true
275+
},
276+
{
277+
text: "evaluation_steps",
278+
explain: "eval_steps sets the number of steps between evaluations, but eval_strategy determines if/when evaluation happens."
279+
},
280+
{
281+
text: "do_eval",
282+
explain: "There's no do_eval parameter in modern TrainingArguments."
283+
}
284+
]}
285+
/>
286+
287+
### 3. What does <code>fp16=True</code> in TrainingArguments enable?
288+
289+
<Question
290+
choices={[
291+
{
292+
text: "16-bit integer precision for faster training.",
293+
explain: "fp16 refers to floating-point precision, not integer precision."
294+
},
295+
{
296+
text: "Mixed precision training with 16-bit floating-point numbers for faster training and reduced memory usage.",
297+
explain: "Mixed precision training uses 16-bit floats for forward pass and 32-bit for gradients, improving speed and reducing memory usage.",
298+
correct: true
299+
},
300+
{
301+
text: "Training for exactly 16 epochs.",
302+
explain: "fp16 has nothing to do with the number of epochs."
303+
},
304+
{
305+
text: "Using 16 GPUs for distributed training.",
306+
explain: "The number of GPUs is not controlled by the fp16 parameter."
307+
}
308+
]}
309+
/>
310+
311+
### 4. What is the role of the <code>compute_metrics</code> function in the Trainer?
312+
313+
<Question
314+
choices={[
315+
{
316+
text: "It calculates the loss during training.",
317+
explain: "Loss calculation is handled automatically by the model, not by compute_metrics."
318+
},
319+
{
320+
text: "It converts logits to predictions and calculates evaluation metrics like accuracy and F1.",
321+
explain: "compute_metrics takes predictions and labels, then returns metrics for evaluation.",
322+
correct: true
323+
},
324+
{
325+
text: "It determines which optimizer to use.",
326+
explain: "Optimizer selection is not handled by compute_metrics."
327+
},
328+
{
329+
text: "It preprocesses the training data.",
330+
explain: "Data preprocessing is done before training, not by compute_metrics during evaluation."
331+
}
332+
]}
333+
/>
334+
335+
### 5. What happens when you don't provide an <code>eval_dataset</code> to the Trainer?
336+
337+
<Question
338+
choices={[
339+
{
340+
text: "Training will fail with an error.",
341+
explain: "Training can proceed without an eval_dataset, though you won't get evaluation metrics."
342+
},
343+
{
344+
text: "The Trainer will automatically split the training data for evaluation.",
345+
explain: "The Trainer doesn't automatically create validation splits."
346+
},
347+
{
348+
text: "You won't get evaluation metrics during training, but training will still work.",
349+
explain: "Evaluation is optional - you can train without it, but you won't see validation metrics.",
350+
correct: true
351+
},
352+
{
353+
text: "The model will use the training data for evaluation.",
354+
explain: "The Trainer won't automatically use training data for evaluation - it simply won't evaluate."
355+
}
356+
]}
357+
/>
358+
359+
### 6. What is gradient accumulation and how do you enable it?
360+
361+
<Question
362+
choices={[
363+
{
364+
text: "It saves gradients to disk, enabled with save_gradients=True.",
365+
explain: "Gradient accumulation doesn't involve saving gradients to disk."
366+
},
367+
{
368+
text: "It accumulates gradients over multiple batches before updating, enabled with gradient_accumulation_steps.",
369+
explain: "This allows you to simulate larger batch sizes by accumulating gradients over multiple forward passes.",
370+
correct: true
371+
},
372+
{
373+
text: "It speeds up gradient computation, enabled automatically with fp16.",
374+
explain: "While fp16 can speed up training, gradient accumulation is a separate technique."
375+
},
376+
{
377+
text: "It prevents gradient overflow, enabled with gradient_clipping=True.",
378+
explain: "That describes gradient clipping, not gradient accumulation."
379+
}
380+
]}
381+
/>
166382

167383
<Tip>
168384

169-
✏️ **Try it out!** Fine-tune a model on the GLUE SST-2 dataset, using the data processing you did in section 2.
385+
💡 **Key Takeaways:**
386+
- The `Trainer` API provides a high-level interface that handles most training complexity
387+
- Use `processing_class` to specify your tokenizer for proper data handling
388+
- `TrainingArguments` controls all aspects of training: learning rate, batch size, evaluation strategy, and optimizations
389+
- `compute_metrics` enables custom evaluation metrics beyond just training loss
390+
- Modern features like mixed precision (`fp16=True`) and gradient accumulation can significantly improve training efficiency
170391

171392
</Tip>
172393

0 commit comments

Comments
 (0)