HR Application (RAG)

A modular Retrieval Augmented Generation (RAG) application built with LangChain 0.3 and Python 3.13. This application allows you to index PDF documents and query them using natural language with OpenAI's language models.

Features

Modular Architecture: Factory pattern implementation for easy extensibility
Configurable: YAML-based configuration for all components
PDF Support: Load and process PDF documents
Persistent Vector Store: ChromaDB for efficient document storage and retrieval
CLI Interface: Simple command-line interface for indexing and querying
Comprehensive Testing: Unit tests for all major components

Architecture

rag-application/
├── config/
│   └── config.yaml              # Main configuration file
├── src/
│   ├── factories/               # Factory pattern implementations
│   │   ├── llm_factory.py
│   │   ├── embedding_factory.py
│   │   └── vectorstore_factory.py
│   ├── components/              # Core components
│   │   ├── document_loader.py
│   │   ├── text_splitter.py
│   │   └── retriever.py
│   ├── rag/
│   │   └── rag_pipeline.py      # Main RAG pipeline
│   └── utils/
│       └── config_loader.py     # Configuration loader
├── tests/                       # Unit tests
└── main.py                      # CLI entry point

Installation

Prerequisites

Python 3.13+
OpenAI API key

Setup

Clone the repository:

git clone <repository-url>
cd rag-application

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

cp .env.example .env
# Edit .env and add your OpenAI API key

Configure the application:

# Edit config/config.yaml with your preferred settings

Configuration

The application is configured via config/config.yaml. Key configuration options:

LLM Configuration

llm:
  type: "openai"
  model_name: "gpt-4o-mini"
  temperature: 0.7
  max_tokens: 500

Embedding Configuration

embedding:
  type: "openai"
  model_name: "text-embedding-3-small"

Vector Store Configuration

vectorstore:
  type: "chroma"
  persist_directory: "./data/chroma_db"
  collection_name: "rag_documents"

Document Processing

document_processing:
  chunk_size: 1000
  chunk_overlap: 200

Retrieval Configuration

retrieval:
  top_k: 4
  search_type: "similarity"

Usage

Indexing Documents

Index a single PDF file:

python main.py index /path/to/document.pdf

Index all PDFs in a directory:

python main.py index /path/to/documents/

Querying Documents

Interactive mode (recommended):

python main.py query --interactive

Single query:

python main.py query --question "What is this document about?"

Show source documents:

python main.py query --interactive --show-sources

Custom Configuration

Use a different configuration file:

python main.py --config custom_config.yaml index document.pdf

Testing

Run all tests:

pytest

Run tests with coverage:

pytest --cov=src tests/

Run specific test file:

pytest tests/test_factories.py

Factory Pattern

The application uses the Factory Pattern for creating instances of:

LLM Factory: Creates language model instances (OpenAI)
Embedding Factory: Creates embedding model instances (OpenAI)
Vector Store Factory: Creates vector store instances (Chroma)

To add support for new providers, simply:

Extend the appropriate factory class
Add configuration in config.yaml
Implement the provider-specific creation method

Example:

def _create_anthropic_llm(self, config: Dict[str, Any]) -> Any:
    return ChatAnthropic(
        model=config.get('model_name', 'claude-sonnet-4-20250514'),
        temperature=config.get('temperature', 0.7)
    )

Components

Document Loader

Handles loading PDF documents from files or directories.

Text Splitter

Splits documents into chunks for efficient processing and retrieval.

Retriever

Retrieves relevant document chunks based on similarity search.

RAG Pipeline

Orchestrates the entire RAG workflow:

Document loading
Text splitting
Vector store creation/loading
Query processing
Answer generation

Project Structure Details

src/
├── factories/           # Factory pattern implementations
│   ├── base_factory.py  # Abstract base class for factories
│   ├── llm_factory.py   # LLM instance creation
│   ├── embedding_factory.py  # Embedding model creation
│   └── vectorstore_factory.py  # Vector store creation
│
├── components/          # Modular components
│   ├── document_loader.py  # PDF loading
│   ├── text_splitter.py    # Text chunking
│   └── retriever.py        # Document retrieval
│
├── rag/                 # Main RAG logic
│   └── rag_pipeline.py  # Pipeline orchestration
│
└── utils/               # Utility functions
    └── config_loader.py # YAML configuration loading

Extending the Application

Adding a New LLM Provider

Update src/factories/llm_factory.py:

def create(self, config: Dict[str, Any]) -> Any:
    llm_type = config.get('type', '').lower()
    
    if llm_type == 'openai':
        return self._create_openai_llm(config)
    elif llm_type == 'anthropic':  # New provider
        return self._create_anthropic_llm(config)
    else:
        raise ValueError(f"Unsupported LLM type: {llm_type}")

Update config/config.yaml:

llm:
  type: "anthropic"
  model_name: "claude-sonnet-4-20250514"

Adding a New Vector Store

Follow the same pattern in src/factories/vectorstore_factory.py.

Troubleshooting

Common Issues

API Key Error: Ensure your OpenAI API key is set in .env
File Not Found: Check that PDF paths are correct
Memory Issues: Reduce chunk_size or top_k in config
Empty Results: Ensure documents are indexed before querying

Performance Tips

Chunk Size: Larger chunks (1000-2000) for comprehensive context, smaller (500-1000) for precise retrieval
Overlap: 10-20% of chunk size for better context continuity
Top K: 3-5 documents for most queries, increase for complex questions
Temperature: Lower (0.3-0.5) for factual answers, higher (0.7-0.9) for creative responses

License

MIT License

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new features
Submit a pull request

Support

For issues and questions, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
components		components
config		config
data		data
factories		factories
rag		rag
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
app.py		app.py
main.py		main.py
query_history.json		query_history.json
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

License

Git-BT-Hub/HR_APP

Folders and files

Latest commit

History

Repository files navigation

HR Application (RAG)

Features

Architecture

Installation

Prerequisites

Setup

Configuration

LLM Configuration

Embedding Configuration

Vector Store Configuration

Document Processing

Retrieval Configuration

Usage

Indexing Documents

Querying Documents

Custom Configuration

Testing

Factory Pattern

Components

Document Loader

Text Splitter

Retriever

RAG Pipeline

Project Structure Details

Extending the Application

Adding a New LLM Provider

Adding a New Vector Store

Troubleshooting

Common Issues

Performance Tips

License

Contributing

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages