Merge pull request #862 from huggingface/bump_release

burtenshaw · web-flow · commit 2b39c82a405e · 2025-03-31T12:10:08.000+02:00
[RELEASE] March 31st 2025
diff --git a/chapters/en/chapter12/1.mdx b/chapters/en/chapter12/1.mdx
@@ -79,7 +79,7 @@ Don't worry if you're missing some of these – we'll explain key concepts as we
 
 <Tip>
 
-If you don't have all the prerequisites, check out this [course](chapter1/1.mdx) from units 1 to 11.
+If you don't have all the prerequisites, check out this [course](/course/chapter1/1) from units 1 to 11
 
 </Tip>
 
diff --git a/chapters/en/chapter12/2.mdx b/chapters/en/chapter12/2.mdx
@@ -68,7 +68,7 @@ Think about learning to ride a bike. You might wobble and fall at first (negativ
 
 Now, why is RL so important for Large Language Models? 
 
-Well, training really good LLMs is tricky. We can train them on massive amounts of text from the internet, and they become very good at predicting the next word in a sentence. This is how they learn to generate fluent and grammatically correct text, as we learned in [chapter 2](/chapters/en/chapter2/1).
+Well, training really good LLMs is tricky. We can train them on massive amounts of text from the internet, and they become very good at predicting the next word in a sentence. This is how they learn to generate fluent and grammatically correct text, as we learned in [chapter 2](/course/chapter2/1).
 
 However, just being fluent isn't enough. We want our LLMs to be more than just good at stringing words together. We want them to be:
 
@@ -78,7 +78,7 @@ However, just being fluent isn't enough. We want our LLMs to be more than just g
 
 Pre-training LLM methods, which mostly rely on predicting the next word from text data, sometimes fall short on these aspects.
 
-Whilst supervised training is excellent at producing structured outputs, it can be less effective at producing helpful, harmless, and aligned responses. We explore supervised training in [chapter 11](/chapters/en/chapter11/1).
+Whilst supervised training is excellent at producing structured outputs, it can be less effective at producing helpful, harmless, and aligned responses. We explore supervised training in [chapter 11](/course/chapter11/1).
 
 Fine-tuned models might generate fluent and structured text that is still factually incorrect, biased, or doesn't really answer the user's question in a helpful way.
 
diff --git a/chapters/en/chapter12/3.mdx b/chapters/en/chapter12/3.mdx
@@ -12,7 +12,7 @@ The initial goal of the paper was to explore whether pure reinforcement learning
 
 <Tip>
 
-Up until that point, all the popular LLMs required some supervised fine-tuning, which we explored in [chapter 11](/chapters/en/chapter11/1).
+Up until that point, all the popular LLMs required some supervised fine-tuning, which we explored in [chapter 11](/course/chapter11/1).
 
 </Tip>
 
diff --git a/chapters/en/chapter12/3a.mdx b/chapters/en/chapter12/3a.mdx
@@ -398,6 +398,7 @@ As you continue exploring GRPO, consider experimenting with different group size
 Happy training! 🚀
 
 ## References
+
 1. [RLHF Book by Nathan Lambert](https://github.com/natolambert/rlhf-book)
 2. [DeepSeek-V3 Technical Report](https://huggingface.co/papers/2412.19437)
 3. [DeepSeekMath](https://huggingface.co/papers/2402.03300)
diff --git a/chapters/en/chapter12/6.mdx b/chapters/en/chapter12/6.mdx
@@ -6,13 +6,15 @@
 
 # Practical Exercise: GRPO with Unsloth
 
-In this exercise, you'll fine-tune a model with GRPO (Group Relative Policy Optimization) using Unsloth, to improve a model's reasoning capabilities. We covered GRPO in [Chapter 3](/en/chapter3/3).
+In this exercise, you'll fine-tune a model with GRPO (Group Relative Policy Optimization) using Unsloth, to improve a model's reasoning capabilities. We covered GRPO in [Chapter 3](/course/chapter3/3).
 
 Unsloth is a library that accelerates LLM fine-tuning, making it possible to train models faster and with less computational resources. Unsloth is plugs into TRL, so we'll build on what we learned in the previous sections, and adapt it for Unsloth specifics.
 
 
 <Tip>
+
 This exercise can be run on a free Google Colab T4 GPU. For the best experience, follow along with the notebook linked above and try it out yourself.
+
 </Tip>
 
 ## Install dependencies
@@ -72,7 +74,7 @@ This code loads the model in 4-bit quantization to save memory and applies LoRA
 
 <Tip>
 
-We won't cover the details of LoRA in this chapter, but you can learn more in [Chapter 11](/en/chapter11/3).
+We won't cover the details of LoRA in this chapter, but you can learn more in [Chapter 11](/course/chapter11/3).
 
 </Tip>
 
@@ -146,7 +148,7 @@ The dataset is prepared by extracting the answer from the dataset and formatting
 
 ## Defining Reward Functions
 
-As we discussed in [an earlier page](/en/chapter13/4), GRPO can use reward functions to guide the model's learning based on verifiable criteria like length and formatting.
+As we discussed in [an earlier page](/course/chapter13/4), GRPO can use reward functions to guide the model's learning based on verifiable criteria like length and formatting.
 
 In this exercise, we'll define several reward functions that encourage different aspects of good reasoning. For example, we'll reward the model for providing an integer answer, and for following the strict format.
 
@@ -221,7 +223,7 @@ These reward functions serve different purposes:
 
 ## Training with GRPO
 
-Now we'll set up the GRPO trainer with our model, tokenizer, and reward functions. This part follows the same approach as the [previous exercise](/en/chapter12/5).
+Now we'll set up the GRPO trainer with our model, tokenizer, and reward functions. This part follows the same approach as the [previous exercise](/course/chapter12/5).
 
 ```python
 from trl import GRPOConfig, GRPOTrainer
@@ -278,7 +280,9 @@ trainer.train()
 ```
 
 <Tip warning={true}>
+
 Training may take some time. You might not see rewards increase immediately - it can take 150-200 steps before you start seeing improvements. Be patient!
+
 </Tip>
 
 ## Testing the Model