|
6 | 6 |
|
7 | 7 | # Practical Exercise: GRPO with Unsloth
|
8 | 8 |
|
9 |
| -In this exercise, you'll fine-tune a model with GRPO (Group Relative Policy Optimization) using Unsloth, to improve a model's reasoning capabilities. We covered GRPO in [Chapter 3](/en/chapter3/3). |
| 9 | +In this exercise, you'll fine-tune a model with GRPO (Group Relative Policy Optimization) using Unsloth, to improve a model's reasoning capabilities. We covered GRPO in [Chapter 3](/course/chapter3/3). |
10 | 10 |
|
11 | 11 | Unsloth is a library that accelerates LLM fine-tuning, making it possible to train models faster and with less computational resources. Unsloth is plugs into TRL, so we'll build on what we learned in the previous sections, and adapt it for Unsloth specifics.
|
12 | 12 |
|
13 | 13 |
|
14 | 14 | <Tip>
|
| 15 | + |
15 | 16 | This exercise can be run on a free Google Colab T4 GPU. For the best experience, follow along with the notebook linked above and try it out yourself.
|
| 17 | + |
16 | 18 | </Tip>
|
17 | 19 |
|
18 | 20 | ## Install dependencies
|
@@ -72,7 +74,7 @@ This code loads the model in 4-bit quantization to save memory and applies LoRA
|
72 | 74 |
|
73 | 75 | <Tip>
|
74 | 76 |
|
75 |
| -We won't cover the details of LoRA in this chapter, but you can learn more in [Chapter 11](/en/chapter11/3). |
| 77 | +We won't cover the details of LoRA in this chapter, but you can learn more in [Chapter 11](/course/chapter11/3). |
76 | 78 |
|
77 | 79 | </Tip>
|
78 | 80 |
|
@@ -146,7 +148,7 @@ The dataset is prepared by extracting the answer from the dataset and formatting
|
146 | 148 |
|
147 | 149 | ## Defining Reward Functions
|
148 | 150 |
|
149 |
| -As we discussed in [an earlier page](/en/chapter13/4), GRPO can use reward functions to guide the model's learning based on verifiable criteria like length and formatting. |
| 151 | +As we discussed in [an earlier page](/course/chapter13/4), GRPO can use reward functions to guide the model's learning based on verifiable criteria like length and formatting. |
150 | 152 |
|
151 | 153 | In this exercise, we'll define several reward functions that encourage different aspects of good reasoning. For example, we'll reward the model for providing an integer answer, and for following the strict format.
|
152 | 154 |
|
@@ -221,7 +223,7 @@ These reward functions serve different purposes:
|
221 | 223 |
|
222 | 224 | ## Training with GRPO
|
223 | 225 |
|
224 |
| -Now we'll set up the GRPO trainer with our model, tokenizer, and reward functions. This part follows the same approach as the [previous exercise](/en/chapter12/5). |
| 226 | +Now we'll set up the GRPO trainer with our model, tokenizer, and reward functions. This part follows the same approach as the [previous exercise](/course/chapter12/5). |
225 | 227 |
|
226 | 228 | ```python
|
227 | 229 | from trl import GRPOConfig, GRPOTrainer
|
@@ -278,7 +280,9 @@ trainer.train()
|
278 | 280 | ```
|
279 | 281 |
|
280 | 282 | <Tip warning={true}>
|
| 283 | + |
281 | 284 | Training may take some time. You might not see rewards increase immediately - it can take 150-200 steps before you start seeing improvements. Be patient!
|
| 285 | + |
282 | 286 | </Tip>
|
283 | 287 |
|
284 | 288 | ## Testing the Model
|
|
0 commit comments