Fix Colab notebook in GRPO release (#806)

burtenshaw · qgallouedec · mlabonne · web-flow · commit 6aa4a70e7925 · 2025-03-03T14:35:13.000+01:00
* [CHAPTER] Add chapter 12 on fine-tuning reasoning models (#799) * add basic introduction for students * page on the basics of RL * add page on grpo and the deep seek paper * add grpo in trl page * add coming soon page * add ungraded quizzes on rl and r1 * update toc * format code snippets * fix images in rl page * fix and make correct grpo section in rl page (2.mx) * fix preference data mistake * fix use of 'preference data' in grpo paper walkthrough * improve dpo and ppo comparison in RL page (2.mdx) * respond to feedback on TRL page * add pseudo code and limitations to the GRPO paper page * expand grpo comparison in RL page * add examples of dummy reward functions to TRL page * add length function examples to TRL page * format code snippets * Fix all GRPO acronyms * remove mention of preference datasets * Apply suggestions from code review Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * respond to reviews in RL page * remove unclear paragraph in paper page * [DEMO] add interactive examples for GRPO reward functions (#800) * add marimo example of length based reward function with a slider * move demo into TRL page * experiment with marimo outside of prose * update TOC with marimo example * use marimo table for representation * remove more snippet returns * try with simple strings * drop styling from marimo box * try pure iframe component * try without return values * fall back to hello world marimo example * try snippet after marimo * define marimo in python script * add real marimo example * add real marimo example with length reward * hide code and headers for tidyness * add markdown for explanaition * add markdown for explanaition * move markdown up * fix missing slider * add notebooks to real locations * remove experimentation page * use correct urls and add todos * update all image links due to hub org rename * fix query params in notebook urls * reorder images to match prose * add notebook exercise Co-authored-by: Maxime Labonne <labonne.maxime@gmail.com> * update the toc * chang images in exercise * give clearer definition of what students will learn * add inference providers example * add reference to open r1 implementation of GRPO * add section on pushing model to the hub during training. * update marimo examples * improve inference section in exercise * move from chapter 13 to chapter12 * update coming soon section with future releases * use a table for unit releases * switch round r1 demo for iframe and link * make the marimo boxes longer * add maxime's name to the unit. * use released version of marimo notebooks --------- Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Maxime Labonne <labonne.maxime@gmail.com> * fix reference to colab notebook chpater13 --------- Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Maxime Labonne <labonne.maxime@gmail.com>
diff --git a/chapters/en/chapter12/5.mdx b/chapters/en/chapter12/5.mdx
@@ -1,7 +1,7 @@
 <CourseFloatingBanner chapter={2}
   classNames="absolute z-10 right-0 top-0"
   notebooks={[
-    {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/main/course/en/chapter12/grpo_finetune.ipynb"},
+    {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/main/course/en/chapter13/grpo_finetune.ipynb"},
 ]} />
 
 # Practical Exercise: Fine-tune a model with GRPO
@@ -163,7 +163,6 @@ As you can see, the reward from the reward function moves closer to 0 as the mod
 
 ![Reward from reward function](https://huggingface.co/reasoning-course/images/resolve/main/grpo/13.png)
 
-<!-- @qgallouedec @mlabonne could you review this section please!? -->
 You might notice that the loss starts at zero and then increases during training, which may seem counterintuitive. This behavior is expected in GRPO and is directly related to the mathematical formulation of the algorithm. The loss in GRPO is proportional to the KL divergence (the cap relative to original policy) . As training progresses, the model learns to generate text that better matches the reward function, causing it to diverge more from its initial policy. This increasing divergence is reflected in the rising loss value, which actually indicates that the model is successfully adapting to optimize for the reward function.
 
 ![Loss](https://huggingface.co/reasoning-course/images/resolve/main/grpo/14.png)