Retail Data Pipeline Project

This repository presents a robust, enterprise-ready, cloud-native data pipeline designed for the complete processing, transformation, and visualization of retail transaction data. Built with production-level best practices in mind, the architecture not only enables full automation but also prioritizes modularity, scalability, maintainability, and secure data handling. The system leverages a suite of industry-standard technologies, including Terraform, Docker, Apache Airflow, PySpark, Google Cloud Storage (GCS), BigQuery, and Looker Studio.

The pipeline orchestrates every stage of the data lifecycle, from raw ingestion to transformation and final business insight generation. It simulates the operational role of a cloud data engineer and provides a scalable foundation for batch processing with extensibility for real-time streaming use cases. The approach adheres to infrastructure-as-code principles and incorporates best practices in workflow orchestration, distributed processing, and cloud-native analytics.

Pipeline Architecture Diagram

Pipeline Flow Overview (Mermaid)

graph TD
    A[Terraform: GCP Infrastructure Setup] --> B[GCS Bucket + BigQuery Dataset]
    B --> C[Upload Raw CSV via Airflow DAG 1]
    C --> D[Raw Data Stored in GCS Bucket]
    D --> E[Raw Data Ingested to BigQuery - Airflow DAG 2]
    D --> F[Jupyter Notebook for PySpark Transformation]
    F --> G[Cleaned CSV Output Locally]
    G --> H[Upload Cleaned CSV to GCS - Airflow DAG 3]
    H --> I[Load Cleaned Data to BigQuery - Airflow DAG 4]
    I --> J[Dashboard Construction with Looker Studio]

Project Directory Structure

Retail-Data-Pipeline/
├── Infrastructure/           # Terraform modules for GCP resource provisioning
├── Airflow/                  # Workflow DAGs, data, and credentials
│   ├── dags/                 # DAGs for raw and cleaned data ETL flows
│   ├── data/                 # Local staging for CSV files
│   ├── keys/                 # GCP service account credentials
├── Notebooks/                # PySpark transformation notebooks
├── Dashboard/                # Looker Studio PDF snapshot
├── Scripts/                  # Custom helper scripts (optional)
├── requirements.txt          # Python package dependencies
├── Makefile                  # Optional automation tasks
└── README.md

Technology Stack

Tool/Service	Role & Purpose
Terraform	Infrastructure as Code for provisioning GCP resources
GCP: GCS & BigQuery	Cloud storage for raw/processed data and cloud data warehouse
Apache Airflow	DAG-based orchestration and scheduling of ETL workflows
PySpark	Distributed data processing framework for transformation
Jupyter Notebook	Interactive data exploration and PySpark development
Looker Studio	Cloud-native BI platform for dashboarding and analytics
Docker	Local development and orchestration of Airflow environment

Pipeline Phases (Detailed Execution Plan)

Phase 1: Infrastructure Provisioning with Terraform

Provisioned key Google Cloud resources using declarative, version-controlled Terraform scripts.
Deployed a GCS bucket for data lake storage and a BigQuery dataset for analytical querying.
Ensured reproducibility and modularity through isolated resource modules and parameterization.
The foundation supports future CI/CD automation and secure secret management.

Phase 2: Raw Data Ingestion Pipeline (Airflow)

Built Airflow DAG 1 to upload retail CSV data into GCS.
Implemented Airflow DAG 2 to transfer GCS data into a BigQuery staging table.
Included retry logic, parameterization, logging, and schema enforcement.
A modular DAG structure promotes reusability and supports multi-environment deployment.

Phase 3: Data Transformation (PySpark via Jupyter Notebook)

Utilized PySpark within a Jupyter notebook for high-scale local transformation.
Applied a multi-step ETL procedure including:
- Schema casting for numeric, categorical, and temporal fields
- Removal of Personally Identifiable Information (PII)
- Handling nulls via imputation and filtering
- Normalization of categorical values
- Business logic-based filtering (e.g., remove invalid ratings/age)
- Column name standardization (snake_case)
Output saved locally as a cleaned CSV file for subsequent loading.

Phase 4: Cleaned Data Loading

Airflow DAG 3 re-uploaded the transformed dataset to GCS.
Airflow DAG 4 loaded the cleaned data into a production-grade BigQuery table.
Write operations followed WRITE_TRUNCATE semantics to preserve consistency and reproducibility.
Clearly separate raw and clean pipelines for auditability and traceability.

Phase 5: Visualization with Looker Studio

Connected Looker Studio to the clean BigQuery dataset.
Designed an interactive dashboard comprising:
- Aggregated insights on regional, temporal, and demographic breakdowns
- Transaction and revenue trends
- Product category sales
- Customer feedback and satisfaction metrics
Enhanced with filtering, drill-downs, and responsive components.
Dashboard link (public): Retail Looker Dashboard

Project Outcomes & Professional Relevance

This project demonstrates a production-grade implementation of cloud data engineering principles and toolsets. It showcases:

A complete, automated ETL pipeline built on open-source and cloud-native infrastructure
Integration of IaC, workflow orchestration, distributed data processing, and cloud BI
Proper separation between raw and transformed data workflows for lineage and compliance
Real-world experience in technologies used by leading tech companies
Readiness for professional deployment and CI/CD integration

License

Distributed under the MIT License.

Author

Developed by Sina Tavakoli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retail Data Pipeline Project

Pipeline Architecture Diagram

Pipeline Flow Overview (Mermaid)

Project Directory Structure

Technology Stack

Pipeline Phases (Detailed Execution Plan)

Phase 1: Infrastructure Provisioning with Terraform

Phase 2: Raw Data Ingestion Pipeline (Airflow)

Phase 3: Data Transformation (PySpark via Jupyter Notebook)

Phase 4: Cleaned Data Loading

Phase 5: Visualization with Looker Studio

Project Outcomes & Professional Relevance

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Airflow		Airflow
Dashboard		Dashboard
Data		Data
Infrastructure		Infrastructure
Notebooks		Notebooks
Scripts		Scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
Query.txt		Query.txt
README.md		README.md
pipeline_diagram.png		pipeline_diagram.png
requirements.txt		requirements.txt

License

sntk-76/Retail-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Retail Data Pipeline Project

Pipeline Architecture Diagram

Pipeline Flow Overview (Mermaid)

Project Directory Structure

Technology Stack

Pipeline Phases (Detailed Execution Plan)

Phase 1: Infrastructure Provisioning with Terraform

Phase 2: Raw Data Ingestion Pipeline (Airflow)

Phase 3: Data Transformation (PySpark via Jupyter Notebook)

Phase 4: Cleaned Data Loading

Phase 5: Visualization with Looker Studio

Project Outcomes & Professional Relevance

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages