-
Notifications
You must be signed in to change notification settings - Fork 253
perf: add some benchmark for lance #4874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR introduces a new benchmarking suite for Daft's Lance data source integration. The changes add a complete benchmarking framework under benchmarking/lance/
that compares Daft's performance against DuckDB and Lance native operations when working with Lance datasets.
The implementation follows the existing benchmarking patterns in the repository (similar to the parquet benchmarks). The suite includes:
- Data Generation:
generate.py
creates test datasets by downloading ImageNet JPEG images from S3, decoding them with Daft's image processing capabilities, and writing them to Lance format - Performance Tests:
test_query.py
implements pytest-benchmark tests comparing count and aggregation operations across different frameworks - Dependencies:
benchmark-requirements.txt
pins specific versions of testing tools (pytest-benchmark, pytest-memray) and competing frameworks (DuckDB, Datafusion) - Documentation:
README.md
provides setup instructions and workflow guidance - Package Structure:
__init__.py
makes the directory a proper Python package
The benchmarking methodology focuses on common data operations like counting, sum aggregations on primitive types, and sum operations on struct fields. This fits into Daft's broader performance optimization efforts by providing quantitative comparisons against established data processing frameworks when working with Lance's columnar storage format.
PR Description Notes:
- The PR description is incomplete - it lacks details about what changes were made and why
- All checklist items remain unchecked
- No related issues are referenced
Confidence score: 2/5
- This PR has significant security and reliability issues that need addressing before merge
- Hardcoded placeholder credentials in
generate.py
pose security risks, and there are several code quality issues including unreachable code paths and missing error handling - Files needing attention:
generate.py
(security credentials),test_query.py
(unreachable code, hardcoded paths),README.md
(missing diagram, placeholder credentials in docs)
5 files reviewed, 5 comments
else: | ||
return df.select("image").collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: This else branch is unreachable since TEST_CASES only contains the three specific cases. Consider removing or adding test cases that would trigger this path.
else: | ||
return df["image"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Unreachable else branch - TEST_CASES doesn't contain cases that would reach this code.
if agg_name == "sum_struct": | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Returning None but still running benchmark.pedantic() will benchmark None return, which may not provide meaningful results.
from daft.io import IOConfig, S3Config | ||
|
||
DATASET_PATH = os.getenv("DATASET_PATH", "s3://xx/ai/dataset/ILSVRC2012/") | ||
LANCE_PATH = os.getenv("LANCE_PATH", "s3://xx/ai/dataset/ILSVRC2012/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: LANCE_PATH variable is defined but never used in the script
) | ||
|
||
df = daft.from_glob_path(f"{DATASET_PATH}*/*.JPEG", io_config) | ||
df = daft.sql("SELECT * FROM df") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Redundant SQL SELECT statement - daft.sql("SELECT * FROM df")
is unnecessary since df already contains all data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. This is unnecessary
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4874 +/- ##
==========================================
- Coverage 79.78% 79.33% -0.45%
==========================================
Files 896 896
Lines 125817 125817
==========================================
- Hits 100377 99814 -563
- Misses 25440 26003 +563 🚀 New features to boost your workflow:
|
aa1922f
to
f7f0bab
Compare
AWS_KEY_ID = os.getenv("AWS_KEY_ID", "xx") | ||
AWS_ACCESS_KEY = os.getenv("AWS_ACCESS_KEY", "xx==") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the standard AWS env vars here. AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
) | ||
|
||
df = daft.from_glob_path(f"{DATASET_PATH}*/*.JPEG", io_config) | ||
df = daft.sql("SELECT * FROM df") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. This is unnecessary
|
||
|
||
@pytest.mark.parametrize("agg_name, agg_expr", TEST_CASES) | ||
def test_duckdb_read_lance(benchmark, agg_name, agg_expr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this duckdb benchmark seems a bit unfair as it's not using lance, but arrow, and the lance to arrow conversion happens outside of the benchmark.
@Jay-ju, feel free to ping me when these comments are addressed and I can take another look! |
Changes Made
Related Issues
#4763
I brought up the results of running locally, which are approximately 1000 records.
Checklist
docs/mkdocs.yml
navigation