
Imagine a scenario of daily (or weekly) sports betting where you're on a quest to outsmart the bookies. This project houses a data warehouse simulation housing the European Soccer Database date. Utilizing team and player statistics, performance metrics, FIFA stats, and bookie odds, we'll hunt down opportunities and place value bets.
Within the pipeline, you can:
- Version Your Dataset: run preprocessing to (re)generate your ML dataset
- Experiment & Store: run and save ML experiments
- Model Management: save and compare models
- Reproducibility: ensure inference pipelines run without train/serving skew (run simulations)
- Feature Store: house all input features with the available KPIs at that time
- Prediction Audit: maintain a log of all predictions
- Python (>=3.11, tested until 3.12)
- Access to a Databricks cluster (e.g., Azure free account)
Manual setup:
- install virtual environment
virtualenv .venv source venv/bin/activate pip install -r requirements.txt # optional for contribution pip install -r requirements-dev.txt
- Download data from here -> you need a Kaggle account. Drop the resulting
database.sqlite
file in the data folder. - Convert data to parquet and csv files
python scripts/convert_data.py
- Databricks
- Create a SQL warehouse
- Create a personal access token
- Upload data (parquet files) to your schema of choice
- Create a compute cluster
- Setup environment / secrets, fill in the template in
env.templ
(rename as .env) and set the env vars (eg.set -a && source .env
) - dbt setup
- initialise and install dependencies:
dbt deps
- setup your dbt profile, normally env vars have set it correctly so nothing needed here (
dbt_your_best_bet/profiles/profiles.yml
)
- initialise and install dependencies:
- install
riskrover
python package, managed with poetry, on your compute- build the package:
cd riskrover && poetry build
- Install the resulting whl file (
riskrover/dist/riskrover-x.y.z-py3-none-any.whl
) on your databricks compute cluster: Compute-scoped libraries
- build the package:
You should now be able to run the pipeline without any trained models, eg. just the preprocessing:
dbt build --selector gold
The setup described above is manual and intended for demonstration purposes. For production deployments, consider the following best practices:
- Infrastructure as Code: Use tools like Terraform to provision Databricks clusters, manage accounts, networking, and other resources.
- Containerization & Orchestration: Containerize your dbt environment (e.g., with Docker) and orchestrate workflows using tools like Apache Airflow.
- Package and publish
riskrover
to a private code repository, install dynamically on cluster within your CI/CD pipeline - Secrets & Environment Management: Manage secrets and environment variables securely using services such as Databricks Secrets or Azure Key Vault.
- CI/CD: Implement continuous integration and deployment pipelines for automated testing and deployment, with Github Actions for example.
Explore and see how we could've made profit back in 2016 if we had access to this data at the right time :D.
The default variables are stored in dbt_project.yaml
. We find ourselves on 2016-01-01 in our simulation, with the option to run until 2016-05-25.
cd dbt_your_best_bet
# Preprocessing
dbt build --selector gold
# Experimentation (by default -> training set to 2015-07-31, and trains a simple logistic regression with cross validation)
dbt build --selector ml_experiment
# Inference on test set (2015-08-01 -> 2015-12-31)
dbt build --selector ml_predict_run
# moving forward in time, for example with a weekly run
dbt build --vars '{"run_date": "2016-01-08"}'
dbt build --vars '{"run_date": "2016-01-15"}'
dbt build --vars '{"run_date": "2016-01-22"}'
...
# check if you made any money by compiling and running an analysis file
dbt compile -s analyses/compare_model_profit.sql
Analysis is available in riskrover/notebooks/riskrover.ipynb
.
dbt docs generate
dbt docs serve
### Cleanup
drop table if exists snapshots.predict_input_history;
drop table if exists snapshots.experiment_history;
then rebuild with --full-refresh
Mostly maintenance, no plans on new features unless requested.
- Extra documentation
- Data tests and unit tests
- Extra sql analysis
Distributed under the MIT License.