01-intro

Jun 23, 2025

826c478 · Jun 23, 2025

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Update README.md (#104 )	Jun 23, 2025
documents-llm.json	documents-llm.json	rearranged syllabus	May 9, 2025
documents.json	documents.json	unit 1 and 2 + data	Jun 6, 2024
elastic-search.md	elastic-search.md	multi_match query clarification	Jun 19, 2024
open-ai-alternatives.md	open-ai-alternatives.md	Update open-ai-alternatives.md	Jul 5, 2024
parse-faq.ipynb	parse-faq.ipynb	parse faq notebook	Jun 6, 2024
rag-intro.ipynb	rag-intro.ipynb	remove wget for minsearch	Jun 8, 2025

README.md

Module 1: Introduction

In this module, we will learn what LLM and RAG are and implement a simple RAG pipeline to answer questions about the FAQ Documents from our Zoomcamp courses

What we will do:

Index Zoomcamp FAQ documents
- DE Zoomcamp: https://docs.google.com/document/d/19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw/edit
- ML Zoomcamp: https://docs.google.com/document/d/1LpPanc33QJJ6BSsyxVg-pWNMplal84TdZtq10naIhD8/edit
- MLOps Zoomcamp: https://docs.google.com/document/d/12TlBfhIiKtyBv8RnsoJR6F72bkPDGEvPOItJIxaEzE0/edit
Create a Q&A system for answering questions about these documents

1.1 Introduction to LLM and RAG

LLM
RAG
RAG architecture
Course outcome

1.2 Preparing the Environment

Installing libraries
Alternative: installing anaconda or miniconda

pip install tqdm notebook==7.1.2 openai elasticsearch==8.13.0 pandas scikit-learn ipywidgets

1.3 Retrieval

Note: as of now, you can install minsearch with pip:

pip install minsearch

We will use the search engine we build in the build-your-own-search-engine workshop: minsearch
Indexing the documents
Peforming the search

1.4 Generation with OpenAI

Invoking OpenAI API
Building the prompt
Getting the answer

If you don't want to use a service, you can run an LLM locally refer to module 2 for more details.

In particular, check "2.7 Ollama - Running LLMs on a CPU" - it can work with OpenAI API, so to make the example from 1.4 work locally, you only need to change a few lines of code.

1.4.2 OpenAI API Alternatives

Open AI Alternatives

1.5 Cleaned RAG flow

Cleaning the code we wrote so far
Making it modular

1.6 Searching with ElasticSearch

Run ElasticSearch with Docker
Index the documents
Replace MinSearch with ElasticSearch

Running ElasticSearch:

docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

If the previous command doesn't work (i.e. you see "error pulling image configuration"), try to run ElasticSearch directly from Docker Hub:

docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    elasticsearch:8.4.3

Index settings:

{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

Query:

{
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": query,
                    "fields": ["question^3", "text", "section"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "data-engineering-zoomcamp"
                }
            }
        }
    }
}

We use "type": "best_fields". You can read more about different types of multi_match search in elastic-search.md.

1.7 Homework

More information here.

Extra materials

If you're curious to know how the code for parsing the FAQ works, check this video

Open-Source LLMs (optional)

It's also possible to run LLMs locally. For that, we can use Ollama. Check these videos from LLM Zoomcamp 2024 if you're interested in learning more about it:

To see the command lines used in the videos, see 2024 cohort folder

Notes

Notes by slavaheroes
Notes by Pham Nguyen Hung
Notes by dimzachar
Notes by Olawale Ogundeji
Notes by Uchechukwu
Notes by Kamal
Notes by Marat
Notes by Waleed
Cohort 2025|RAG FAQ using Elastic Search by Nitin Gupta
Did you take notes? Add them above this line (Send a PR with links to your notes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

01-intro

01-intro

README.md

Module 1: Introduction

1.1 Introduction to LLM and RAG

1.2 Preparing the Environment

1.3 Retrieval

1.4 Generation with OpenAI

1.4.2 OpenAI API Alternatives

1.5 Cleaned RAG flow

1.6 Searching with ElasticSearch

1.7 Homework

Extra materials

Open-Source LLMs (optional)

Notes

Collapse file tree

Files

01-intro

Directory actions

More options

Directory actions

More options

Latest commit

History

01-intro

Folders and files

parent directory

README.md

Module 1: Introduction

1.1 Introduction to LLM and RAG

1.2 Preparing the Environment

1.3 Retrieval

1.4 Generation with OpenAI

1.4.2 OpenAI API Alternatives

1.5 Cleaned RAG flow

1.6 Searching with ElasticSearch

1.7 Homework

Extra materials

Open-Source LLMs (optional)

Notes