Skip to content

Commit d4d64ee

Browse files
lewtunartyomboykoMKhalusovamariosaskopdumin
authored
Bump release (#768)
* Update chapters/ru/chapter7/5.mdx There's that extra space again. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/5.mdx There's that extra space again that I didn't notice. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/5.mdx Extra space. Co-authored-by: Maria Khalusova <[email protected]> * Update 5.mdx Translated the missing comment. * Update chapters/ru/chapter7/4.mdx Extra space. Co-authored-by: Maria Khalusova <[email protected]> * Update 2.mdx Translated the missing comment in the code * Update 2.mdx Translated the missing sentence. * Update 3.mdx Translated the missing sentence. * Update 3.mdx I agree, it sounds more neutral that way. * Update chapters/ru/chapter7/3.mdx An unnecessary parenthesis. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/3.mdx Also an option, but we've translated it as "карточка модели" a lot of places. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/3.mdx Extra space. Co-authored-by: Maria Khalusova <[email protected]> * Update 3.mdx Translated the missing comment in the code. * Update chapters/ru/chapter7/3.mdx Extra sapce. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/4.mdx Extra space. Co-authored-by: Maria Khalusova <[email protected]> * Update 4.mdx Translated the missing comment in the code. * Update 5.mdx Added and translated the missing sentence: "Since the collator expects a list of dicts, where each dict represents a single example in the dataset, we also need to wrangle the data into the expected format before passing it to the data collator:" * Update 5.mdx Edit the display of the table on the course page. * fixed links to other chapters * fixed links to chapters' intros * I added myself to the Languages and translations table. * Deleted unnecessary folder automatically created by JupyterLab. * Fix links to HF docs * Finalizing the translation of chapter 7. * Update 6.mdx Extra space * Update 7.mdx Extra space * Update chapters/ru/chapter7/6.mdx Correcting a link Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/6.mdx Correcting a link Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/6.mdx Correcting a link Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/7.mdx Correcting a link Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/6.mdx Correcting a link Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/7.mdx Correcting a link Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/7.mdx Correcting a link Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/8.mdx Correction of abbreviation - NLP Co-authored-by: Maria Khalusova <[email protected]> * Update 7.mdx Translated the code commentary * Update 6.mdx Translated the missing sentence. * Update chapters/ru/chapter7/7.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update 6.mdx * Update chapters/ru/chapter7/6.mdx Correcting a link Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/7.mdx Correcting a link Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter7/6.mdx Co-authored-by: Maria Khalusova <[email protected]> * 8/1-2 done * 8/3 finished * 8/4 finished * fix typo * toc update * typos fixed * removed english text * 8/5 finished * 8/6-7 finished * fix and update toc * chapter8/1 fixed * chapter8/2 fixed * chapter8/3 fixed * chapter8/4 fixed * chapter8/5 fixed * fix title 8/5 * fix title 8/5 in toc * Update _toctree.yml title 8 Co-authored-by: Maria Khalusova <[email protected]> * Bump black (#671) * fix unexpected token in quiz * 8/2 fixed * 8/3 fixed * 8/4_tf fixed * Update 3b.mdx fix typo * Added translation of chapter 9 and Course Events. * Added translation of chapter 9 and Course Events. * Update 5.mdx Fix typo * Update 7.mdx Fix typo * Update 10.mdx fix tpo * Update chapters/ru/chapter9/6.mdx OK Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter9/7.mdx OK Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter9/7.mdx OK Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx I agree. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Sorry. I was hasty and made an incorrect assumption))) Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/1.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/events/2.mdx Removing tag. I agree with you. Co-authored-by: Maria Khalusova <[email protected]> * Capture the current state of the translation. Two files are translated: , . Corrections to the table of contents. * Made a full translation of Chapter 2. * Fix problem in . * Deleting JupyterLab backup files. * Update 8.mdx Correcting problems in file. * Update 8.mdx Translated a missing piece of text in English. * remove original sentence * Update chapters/ru/chapter2/2.mdx OK. Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter2/2.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter2/4.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter2/5.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter2/5.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter2/5.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter2/5.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter2/2.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update 2.mdx * Update 2.mdx * Update chapters/ru/chapter2/2.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update 2.mdx * Update chapters/ru/chapter2/2.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter2/3.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update chapters/ru/chapter2/4.mdx Co-authored-by: Maria Khalusova <[email protected]> * Update 4.mdx * Minor edits to the table of contents, and the titles of the final test files at the end of the chapter. * chapter 2's introduction hackable French translation In this context, the word "hackable" doesn't translate in "piratable" (which conveys the notion of illegal piracy). I believe that "modifiable" (i.e can be modified) is more appropriate. * Chapter 2.2 - doesn't make sense French translation In French, "doesn't make sense" translates in "n'a pas de sens". * [doc-fix] Add accelerate required for notebook 3/3 * translation fix chapter2 - section3: a mistake on TensorFlow part, which use a pytorch keyword * fix: typo * fix formatting * remove unnecessary translate * fix typo and formatting in Chinese translation * fix zh-cn translation for chapter 3-6 * fix formatting issue for zh-cn on chapter 5-8 * fix zh-cn translation on chapter 4-6 * docs: mdx typo * fix deactivate * tip to install datasets * 修正翻译 * fix zh * fix zh * Changing tokenized_dataset to tokenized_datasets Signed-off-by: Jiri Podivin <[email protected]> * Update 9.mdx fix typo. * Update 2.mdx fix typo * [zh-CN/TW] bugfix of broken image link: Companies using Hugging Face * [zh-CN/TW] pipeline name translation is not needed * [zh-CN/TW] translation of `Hub`: 集线器(集線器) => 模型中心 * correct zh translation * correct zh-TW translation * finish review chapter7 8 9 This year, my friends and I conducted a comprehensive review of the Chinese translations published before, greatly improving the readability, hoping that they can help more people. We have completed the second half of the first part, and are gradually reviewing the first half. * format fr/chapter9/4.mdx * Update 7.mdx add the miss close tag * create rum folder * Add Steven as reviewer (#746) * update toctree * translate chapter0-1 * update course0-1 * remove files and folders that are not updated * add the rum folder in the build documentation * Introduction to Argilla * Set up Argilla * Remove mention of chapters 10-12 * finish review ZH-CN chapter1-6 * code_format for chaper 1-6 * Fixed wrong full width colon * Initial draft * Fix * Corrections section 2 * Section 3 improvements * More improvements * Images & apply review comments * Apply suggestions from code review Co-authored-by: vb <[email protected]> * Fix style * Updated images and banners * More screenshots * Fix quiz inline code * More improvements from reviews * Added chapter 0 and initiated _toctree for Nepali Language * Added Nepali language code in the workflow github actions * Ran make styles without any errors! * Update chapters/ne/chapter0/1.mdx Co-authored-by: Steven Liu <[email protected]> * Update chapters/ne/chapter0/1.mdx Co-authored-by: Steven Liu <[email protected]> * Made same codeblocks for activate and deactivate --------- Signed-off-by: Jiri Podivin <[email protected]> Co-authored-by: Artyom Boyko <[email protected]> Co-authored-by: Maria Khalusova <[email protected]> Co-authored-by: mariosasko <[email protected]> Co-authored-by: Pavel <[email protected]> Co-authored-by: Pavel <[email protected]> Co-authored-by: Yanis ALLOUCH <[email protected]> Co-authored-by: Fabrizio Damicelli <[email protected]> Co-authored-by: Florent Flament <[email protected]> Co-authored-by: Dmitrii Ioksha <[email protected]> Co-authored-by: brealid <[email protected]> Co-authored-by: Owen <[email protected]> Co-authored-by: ruochenhua <[email protected]> Co-authored-by: Jesse Zhang <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: tal7aouy <[email protected]> Co-authored-by: buqieryu <[email protected]> Co-authored-by: Jiri Podivin <[email protected]> Co-authored-by: Taeyoon Kim <[email protected]> Co-authored-by: Jks Liu <[email protected]> Co-authored-by: huqian <[email protected]> Co-authored-by: Omar Sanseviero <[email protected]> Co-authored-by: Tiezhen WANG <[email protected]> Co-authored-by: Adam Molnar <[email protected]> Co-authored-by: 1375626371 <[email protected]> Co-authored-by: Hu Yaoqi <[email protected]> Co-authored-by: eduard-balamatiuc <[email protected]> Co-authored-by: Eduard Balamatiuc <[email protected]> Co-authored-by: nataliaElv <[email protected]> Co-authored-by: Steven Liu <[email protected]> Co-authored-by: Ann Huang <[email protected]> Co-authored-by: Ann Huang <[email protected]> Co-authored-by: Natalia Elvira <[email protected]> Co-authored-by: vb <[email protected]> Co-authored-by: CRLannister <[email protected]> Co-authored-by: Ashish Agarwal <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>
1 parent 40c51fb commit d4d64ee

File tree

148 files changed

+4439
-3574
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

148 files changed

+4439
-3574
lines changed

.github/workflows/build_documentation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,6 @@ jobs:
1414
package: course
1515
path_to_docs: course/chapters/
1616
additional_args: --not_python_module
17-
languages: ar bn de en es fa fr gj he hi id it ja ko pt ru th tr vi zh-CN zh-TW
17+
languages: ar bn de en es fa fr gj he hi id it ja ko ne pt ru rum th tr vi zh-CN zh-TW
1818
secrets:
1919
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}

.github/workflows/build_pr_documentation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,4 @@ jobs:
1616
package: course
1717
path_to_docs: course/chapters/
1818
additional_args: --not_python_module
19-
languages: ar bn de en es fa fr gj he hi id it ja ko pt ru th tr vi zh-CN zh-TW
19+
languages: ar bn de en es fa fr gj he hi id it ja ko ne pt ru rum th tr vi zh-CN zh-TW

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ pip install -r requirements.txt
110110
make style
111111
```
112112
113-
Once that's run, commit any changes, open a pull request, and tag [@lewtun](https://github.com/lewtun) for a review. Congratulations, you've now completed your first translation 🥳!
113+
Once that's run, commit any changes, open a pull request, and tag [@lewtun](https://github.com/lewtun) and [@stevhliu](https://github.com/stevhliu) for a review. If you also know other native-language speakers who are able to review the translation, tag them as well for help. Congratulations, you've now completed your first translation 🥳!
114114
115115
> 🚨 To build the course on the website, double-check your language code exists in `languages` field of the `build_documentation.yml` and `build_pr_documentation.yml` files in the `.github` folder. If not, just add them in their alphabetical order.
116116

chapters/ar/chapter0/1.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ ls -a
105105
source .env/bin/activate
106106
107107
# Deactivate the virtual environment
108-
source .env/bin/deactivate
108+
deactivate
109109
```
110110

111111
<div dir="rtl" style="direction:rtl;text-align:right;">

chapters/bn/chapter0/1.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ ls -a
8787
source .env/bin/activate
8888
8989
# virtual environment টি deactivate করার কমান্ড
90-
source .env/bin/deactivate
90+
deactivate
9191
```
9292

9393
`which python` কমান্ড চালিয়ে নিশ্চিত করতে পারেন যে virtual environment টি activate হয়েছে কিনা।

chapters/de/chapter0/1.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ Mit den Skripten "activate" und "deactivate" kannst du in deine virtuelle Umgebu
8686
source .env/bin/activate
8787
8888
# Deaktivieren der virtuellen Umgebung
89-
source .env/bin/deactivate
89+
deactivate
9090
```
9191

9292
Du kannst dich vergewissern, dass die Umgebung aktiviert ist, indem du den Befehl `which python` ausführst: Wenn er auf die virtuelle Umgebung verweist, dann hast du sie erfolgreich aktiviert!

chapters/de/chapter3/2.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,9 @@ In diesem Abschnitt verwenden wir den MRPC-Datensatz (Microsoft Research Paraphr
8787
Das Hub enthält nicht nur Modelle; Es hat auch mehrere Datensätze in vielen verschiedenen Sprachen. Du kannst die Datensätze [hier](https://huggingface.co/datasets) durchsuchen, und wir empfehlen, einen weiteren Datensatz zu laden und zu verarbeiten, sobald Sie diesen Abschnitt abgeschlossen haben (die Dokumentation befindet sich [hier](https://huggingface.co/docs/datasets/loading)). Aber jetzt konzentrieren wir uns auf den MRPC-Datensatz! Dies ist einer der 10 Datensätze, aus denen sich das [GLUE-Benchmark](https://gluebenchmark.com/) zusammensetzt. Dies ist ein akademisches Benchmark, das verwendet wird, um die Performance von ML-Modellen in 10 verschiedenen Textklassifizierungsaufgaben zu messen.
8888

8989
Die Bibliothek 🤗 Datasets bietet einen leichten Befehl zum Herunterladen und Caching eines Datensatzes aus dem Hub. Wir können den MRPC-Datensatz wie folgt herunterladen:
90+
<Tipp>
91+
⚠️ ** Warnung** Stelle sicher, dass `datasets` installiert ist, indem du `pip install datasets` ausführst. Dann lade den MRPC-Datensatz und drucke ihn aus, um zu sehen, was er enthält.
92+
</Tipp>
9093

9194
```py
9295
from datasets import load_dataset

chapters/en/_toctree.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,26 @@
191191
title: End-of-chapter quiz
192192
quiz: 9
193193

194+
- title: 10. Curate high-quality datasets
195+
new: true
196+
subtitle: How to use Argilla to create amazing datasets
197+
sections:
198+
- local: chapter10/1
199+
title: Introduction to Argilla
200+
- local: chapter10/2
201+
title: Set up your Argilla instance
202+
- local: chapter10/3
203+
title: Load your dataset to Argilla
204+
- local: chapter10/4
205+
title: Annotate your dataset
206+
- local: chapter10/5
207+
title: Use your annotated dataset
208+
- local: chapter10/6
209+
title: Argilla, check!
210+
- local: chapter10/7
211+
title: End-of-chapter quiz
212+
quiz: 10
213+
194214
- title: Course Events
195215
sections:
196216
- local: events/1

chapters/en/chapter0/1.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ You can jump in and out of your virtual environment with the `activate` and `dea
8686
source .env/bin/activate
8787
8888
# Deactivate the virtual environment
89-
source .env/bin/deactivate
89+
deactivate
9090
```
9191

9292
You can make sure that the environment is activated by running the `which python` command: if it points to the virtual environment, then you have successfully activated it!

chapters/en/chapter1/1.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Here is a brief overview of the course:
2323

2424
- Chapters 1 to 4 provide an introduction to the main concepts of the 🤗 Transformers library. By the end of this part of the course, you will be familiar with how Transformer models work and will know how to use a model from the [Hugging Face Hub](https://huggingface.co/models), fine-tune it on a dataset, and share your results on the Hub!
2525
- Chapters 5 to 8 teach the basics of 🤗 Datasets and 🤗 Tokenizers before diving into classic NLP tasks. By the end of this part, you will be able to tackle the most common NLP problems by yourself.
26-
- Chapters 9 to 12 go beyond NLP, and explore how Transformer models can be used to tackle tasks in speech processing and computer vision. Along the way, you'll learn how to build and share demos of your models, and optimize them for production environments. By the end of this part, you will be ready to apply 🤗 Transformers to (almost) any machine learning problem!
26+
- Chapter 9 goes beyond NLP to cover how to build and share demos of your models on the 🤗 Hub. By the end of this part, you will be ready to showcase your 🤗 Transformers application to the world!
2727

2828
This course:
2929

chapters/en/chapter10/1.mdx

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Introduction to Argilla[[introduction-to-argilla]]
2+
3+
<CourseFloatingBanner
4+
chapter={10}
5+
classNames="absolute z-10 right-0 top-0"
6+
/>
7+
8+
In Chapter 5 you learnt how to build a dataset using the 🤗 Datasets library and in Chapter 6 you explored how to fine-tune models for some common NLP tasks. In this chapter, you will learn how to use [Argilla](https://argilla.io) to **annotate and curate datasets** that you can use to train and evaluate your models.
9+
10+
The key to training models that perform well is to have high-quality data. Although there are some good datasets in the Hub that you could use to train and evaluate your models, these may not be relevant for your specific application or use case. In this scenario, you may want to build and curate a dataset of your own. Argilla will help you to do this efficiently.
11+
12+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter10/signin-hf-page.png" alt="Argilla sign in page."/>
13+
14+
With Argilla you can:
15+
16+
- turn unstructured data into **structured data** to be used in NLP tasks.
17+
- curate a dataset to go from a low-quality dataset to a **high-quality dataset**.
18+
- gather **human feedback** for LLMs and multi-modal models.
19+
- invite experts to collaborate with you in Argilla, or crowdsource annotations!
20+
21+
Here are some of the things that you will learn in this chapter:
22+
23+
- How to set up your own Argilla instance.
24+
- How to load a dataset and configure it based on some popular NLP tasks.
25+
- How to use the Argilla UI to annotate your dataset.
26+
- How to use your curated dataset and export it to the Hub.

chapters/en/chapter10/2.mdx

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Set up your Argilla instance[[set-up-your-argilla-instance]]
2+
3+
<CourseFloatingBanner chapter={10}
4+
classNames="absolute z-10 right-0 top-0"
5+
notebooks={[
6+
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter10/section2.ipynb"},
7+
{label: "Aws Studio", value: "https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter10/section2.ipynb"},
8+
]} />
9+
10+
To start using Argilla, you will need to set up your own Argilla instance first. Then you will need to install the Python SDK so that you can manage Argilla using Python code.
11+
12+
## Deploy the Argilla UI
13+
14+
The easiest way to set up your Argilla instance is through Hugging Face Spaces. To create your Argilla Space, simply follow [this form](https://huggingface.co/new-space?template=argilla%2Fargilla-template-space). If you need further guidance, check the [Argilla quickstart](https://docs.argilla.io/latest/getting_started/quickstart/).
15+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter10/space_config.png" alt="Space configuration form."/>
16+
17+
>[!WARNING]
18+
> ⚠️ You may want to enable **Persistent storage** so the data isn't lost if the Space is paused or restarted.
19+
> You can do that from the Settings of your Space.
20+
21+
Once Argilla is up and running, you can log in with your credentials.
22+
23+
## Install and connect the Python SDK
24+
25+
Now you can go to your Python environment or notebook and install the argilla library:
26+
27+
`!pip install argilla`
28+
29+
Let's connect with our Argilla instance. To do that you will need the following information:
30+
31+
- **Your API URL**: This is the URL where Argilla is running. If you are using a Space, you can open the Space, click on the three dots in the top right corner, then "Embed this Space" and copy the **Direct URL**. It should look something like `https://<your-username>.<space-name>.hf.space`.
32+
- **Your API key**: To get your key, log in to your Argilla instance and go to "My Settings", then copy the API key.
33+
- **Your HF token**: If your Space is private, you will need to an Access Token in your Hugging Face Hub account with writing permissions.
34+
35+
```python
36+
import argilla as rg
37+
38+
HF_TOKEN = "..." # only for private spaces
39+
40+
client = rg.Argilla(
41+
api_url="...",
42+
api_key="...",
43+
headers={"Authorization": f"Bearer {HF_TOKEN}"}, # only for private spaces
44+
)
45+
```
46+
47+
To check that everything is working properly, we'll call `me`. This should return our user:
48+
49+
```python
50+
client.me
51+
```
52+
53+
If this worked, your Argilla instance is up and running and you're connected to it! Congrats!
54+
55+
We can now get started with loading our first dataset to Argilla.

chapters/en/chapter10/3.mdx

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Load your dataset to Argilla[[load-your-dataset-to-argilla]]
2+
3+
<CourseFloatingBanner chapter={10}
4+
classNames="absolute z-10 right-0 top-0"
5+
notebooks={[
6+
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter10/section3.ipynb"},
7+
{label: "Aws Studio", value: "https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter10/section3.ipynb"},
8+
]} />
9+
10+
Depending on the NLP task that you're working with and the specific use case or application, your data and the annotation task will look differently. For this section of the course, we'll use [a dataset collecting news](https://huggingface.co/datasets/SetFit/ag_news) to complete two tasks: a text classification on the topic of each text and a token classification to identify the named entities mentioned.
11+
12+
<iframe
13+
src="https://huggingface.co/datasets/SetFit/ag_news/embed/viewer/default/train"
14+
frameborder="0"
15+
width="100%"
16+
height="560px"
17+
></iframe>
18+
19+
It is possible to import datasets from the Hub using the Argilla UI directly, but we'll be using the SDK to learn how we can make further edits to the data if needed.
20+
21+
## Configure your dataset
22+
23+
The first step is to connect to our Argilla instance as we did in the previous section:
24+
25+
```python
26+
import argilla as rg
27+
28+
HF_TOKEN = "..." # only for private spaces
29+
30+
client = rg.Argilla(
31+
api_url="...",
32+
api_key="...",
33+
headers={"Authorization": f"Bearer {HF_TOKEN}"}, # only for private spaces
34+
)
35+
```
36+
37+
We can now think about the settings of our dataset in Argilla. These represent the annotation task we'll do over our data. First, we can load the dataset from the Hub and inspect its features, so that we can make sure that we configure the dataset correctly.
38+
39+
```python
40+
from datasets import load_dataset
41+
42+
data = load_dataset("SetFit/ag_news", split="train")
43+
data.features
44+
```
45+
46+
These are the features of our dataset:
47+
48+
```python out
49+
{'text': Value(dtype='string', id=None),
50+
'label': Value(dtype='int64', id=None),
51+
'label_text': Value(dtype='string', id=None)}
52+
```
53+
54+
It contains a `text` and also some initial labels for the text classification. We'll add those to our dataset settings together with a `spans` question for the named entities:
55+
56+
```python
57+
settings = rg.Settings(
58+
fields=[rg.TextField(name="text")],
59+
questions=[
60+
rg.LabelQuestion(
61+
name="label", title="Classify the text:", labels=data.unique("label_text")
62+
),
63+
rg.SpanQuestion(
64+
name="entities",
65+
title="Highlight all the entities in the text:",
66+
labels=["PERSON", "ORG", "LOC", "EVENT"],
67+
field="text",
68+
),
69+
],
70+
)
71+
```
72+
73+
Let's dive a bit deeper into what these settings mean. First, we've defined **fields**, these include the information that we'll be annotating. In this case, we only have one field and it comes in the form of a text, so we've choosen a `TextField`.
74+
75+
Then, we define **questions** that represent the tasks that we want to perform on our data:
76+
77+
- For the text classification task we've chosen a `LabelQuestion` and we used the unique values of the `label_text` column as our labels, to make sure that the question is compatible with the labels that already exist in the dataset.
78+
- For the token classification task, we'll need a `SpanQuestion`. We've defined a set of labels that we'll be using for that task, plus the field on which we'll be drawing the spans.
79+
80+
To learn more about all the available types of fields and questions and other advanced settings, like metadata and vectors, go to the [Argilla docs](https://docs.argilla.io/latest/how_to_guides/dataset/#define-dataset-settings).
81+
82+
## Upload the dataset
83+
84+
Now that we've defined some settings, we can create the dataset:
85+
86+
```python
87+
dataset = rg.Dataset(name="ag_news", settings=settings)
88+
89+
dataset.create()
90+
```
91+
92+
The dataset now appears in our Argilla instance, but you will see that it's empty:
93+
94+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter10/empty_dataset.png" alt="Screenshot of the empty dataset."/>
95+
96+
Now we need to add the records that we'll be annotating i.e., the rows in our dataset. To do that, we'll simply need to log the data as records and provide a mapping for those elements that don't have the same name in the hub and Argilla datasets:
97+
98+
```python
99+
dataset.records.log(data, mapping={"label_text": "label"})
100+
```
101+
102+
In our mapping, we've specified that the `label_text` column in the dataset should be mapped to the question with the name `label`. In this way, we'll use the existing labels in the dataset as pre-annotations so we can annotate faster.
103+
104+
While the the records continue to log, you can already start working with your dataset in the Argilla UI. At this point, it should look like this:
105+
106+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter10/argilla_initial_dataset.png" alt="Screenshot of the dataset in Argilla."/>
107+
108+
Now our dataset is ready to start annotating!

chapters/en/chapter10/4.mdx

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Annotate your dataset[[annotate-your-dataset]]
2+
3+
<CourseFloatingBanner
4+
chapter={10}
5+
classNames="absolute z-10 right-0 top-0"
6+
/>
7+
8+
Now it is time to start working from the Argilla UI to annotate our dataset.
9+
10+
## Align your team with annotation guidelines
11+
12+
Before you start annotating your dataset, it is always good practice to write some guidelines, especially if you're working as part of a team. This will help you align on the task and the use of the different labels, and resolve questions or conflicts when they come up.
13+
14+
In Argilla, you can go to your dataset settings page in the UI and modify the guidelines and the descriptions of your questions to help with alignment.
15+
16+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter10/argilla_dataset_settings.png" alt="Screenshot of the Dataset Settings page in Argilla."/>
17+
18+
If you want to dive deeper into the topic of how to write good guidelines, we recommend reading [this blogpost](https://argilla.io/blog/annotation-guidelines-practices) and the bibliographical references mentioned there.
19+
20+
## Distribute the task
21+
22+
In the dataset settings page, you can also change the dataset distribution settings. This will help you annotate more efficiently when you're working as part of a team. The default value for the minimum submitted responses is 1, meaning that as soon as a record has 1 submitted response it will be considered complete and count towards the progress in your dataset.
23+
24+
Sometimes, you want to have more than one submitted response per record, for example, if you want to analyze the inter-annotator agreement in your task. In that case, make sure to change this setting to a higher number, but always smaller or equal to the total number of annotators. If you're working on the task alone, you want this setting to be 1.
25+
26+
## Annotate records
27+
28+
>[!TIP]
29+
>💡 If you are deploying Argilla in a Hugging Face Space, any team members will be able to log in using the Hugging Face OAuth. Otherwise, you may need to create users for them following [this guide](https://docs.argilla.io/latest/how_to_guides/user/).
30+
31+
When you open your dataset, you will realize that the first question is already filled in with some suggested labels. That's because in the previous section we mapped our question called `label` to the `label_text` column in the dataset, so that we simply need to review and correct the already existing labels:
32+
33+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter10/argilla_initial_dataset.png" alt="Screenshot of the dataset in Argilla."/>
34+
35+
For the token classification, we'll need to add all labels manually, as we didn't include any suggestions. This is how it might look after the span annotations:
36+
37+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter10/argilla_dataset_with_spans.png" alt="Screenshot of the dataset in Argilla with spans annotated."/>
38+
39+
As you move through the different records, there are different actions you can take:
40+
- submit your responses, once you're done with the record.
41+
- save them as a draft, in case you want to come back to them later.
42+
- discard them, if the record souldn't be part of the dataset or you won't give responses to it.
43+
44+
In the next section, you will learn how you can export and use those annotations.

0 commit comments

Comments
 (0)