Skip to content

perf: add some benchmark for lance #4874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Jay-ju
Copy link
Contributor

@Jay-ju Jay-ju commented Jul 30, 2025

Changes Made

Related Issues

#4763
I brought up the results of running locally, which are approximately 1000 records.

---------------------------------------------------------------------------------------------------------------------------------- benchmark: 9 tests ---------------------------------------------------------------------------------------------------------------------------------
Name (time in ns)                                                     Min                         Max                        Mean                      StdDev                      Median                         IQR            Outliers             OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_duckdb_read_lance[sum_struct]                               149.0116 (1.0)            1,065.8987 (1.0)              357.9073 (1.0)              399.0490 (1.0)              154.1339 (1.0)              317.5810 (1.0)           1;1  2,794,019.8387 (1.0)           5           1
test_lance_native_read[count-COUNT(1)]                     2,408,016.9387 (>1000.0)    4,414,444.7893 (>1000.0)    2,919,809.4422 (>1000.0)      403,644.6835 (>1000.0)    2,797,108.6092 (>1000.0)      318,493.1120 (>1000.0)     40;21        342.4881 (0.00)        247           1
test_duckdb_read_lance[sum_int]                            2,776,946.8725 (>1000.0)    3,177,690.3197 (>1000.0)    3,046,105.6158 (>1000.0)      164,196.4137 (411.47)     3,100,662.0266 (>1000.0)      214,710.2496 (676.08)        1;0        328.2880 (0.00)          5           1
test_duckdb_read_lance[count]                              2,948,476.0016 (>1000.0)    4,813,462.9615 (>1000.0)    3,923,549.8756 (>1000.0)      710,553.0621 (>1000.0)    3,832,014.2776 (>1000.0)      999,579.2061 (>1000.0)       2;0        254.8712 (0.00)          5           1
test_lance_native_read[sum_int-SUM(size)]                 17,664,573.1553 (>1000.0)   28,122,316.1146 (>1000.0)   21,726,866.9396 (>1000.0)    2,948,127.6718 (>1000.0)   21,872,717.1887 (>1000.0)    5,108,494.3116 (>1000.0)      16;0         46.0260 (0.00)         40           1
test_daft_read_lance[count-COUNT(1)]                      36,386,939.6970 (>1000.0)   38,619,516.8830 (>1000.0)   37,631,585.7284 (>1000.0)      914,912.5088 (>1000.0)   37,788,018.1186 (>1000.0)    1,503,030.3039 (>1000.0)       2;0         26.5734 (0.00)          5           1
test_daft_read_lance[sum_int-SUM(size)]                   39,596,260.9909 (>1000.0)   42,440,177.8728 (>1000.0)   40,978,406.6491 (>1000.0)    1,050,849.4367 (>1000.0)   41,084,280.2376 (>1000.0)    1,330,918.7489 (>1000.0)       2;0         24.4031 (0.00)          5           1
test_lance_native_read[sum_struct-SUM(image.height)]     307,479,785.7553 (>1000.0)  387,991,133.1460 (>1000.0)  338,949,221.2795 (>1000.0)   31,716,360.5403 (>1000.0)  342,062,475.1598 (>1000.0)   41,463,551.9264 (>1000.0)       1;0          2.9503 (0.00)          5           1
test_daft_read_lance[sum_struct-SUM(image.height)]       323,219,778.9475 (>1000.0)  878,646,744.8808 (>1000.0)  697,738,404.6651 (>1000.0)  222,157,485.2833 (>1000.0)  740,964,299.9992 (>1000.0)  251,344,511.1690 (>1000.0)       1;0          1.4332 (0.00)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the perf label Jul 30, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces a new benchmarking suite for Daft's Lance data source integration. The changes add a complete benchmarking framework under benchmarking/lance/ that compares Daft's performance against DuckDB and Lance native operations when working with Lance datasets.

The implementation follows the existing benchmarking patterns in the repository (similar to the parquet benchmarks). The suite includes:

  • Data Generation: generate.py creates test datasets by downloading ImageNet JPEG images from S3, decoding them with Daft's image processing capabilities, and writing them to Lance format
  • Performance Tests: test_query.py implements pytest-benchmark tests comparing count and aggregation operations across different frameworks
  • Dependencies: benchmark-requirements.txt pins specific versions of testing tools (pytest-benchmark, pytest-memray) and competing frameworks (DuckDB, Datafusion)
  • Documentation: README.md provides setup instructions and workflow guidance
  • Package Structure: __init__.py makes the directory a proper Python package

The benchmarking methodology focuses on common data operations like counting, sum aggregations on primitive types, and sum operations on struct fields. This fits into Daft's broader performance optimization efforts by providing quantitative comparisons against established data processing frameworks when working with Lance's columnar storage format.

PR Description Notes:

  • The PR description is incomplete - it lacks details about what changes were made and why
  • All checklist items remain unchecked
  • No related issues are referenced

Confidence score: 2/5

  • This PR has significant security and reliability issues that need addressing before merge
  • Hardcoded placeholder credentials in generate.py pose security risks, and there are several code quality issues including unreachable code paths and missing error handling
  • Files needing attention: generate.py (security credentials), test_query.py (unreachable code, hardcoded paths), README.md (missing diagram, placeholder credentials in docs)

5 files reviewed, 5 comments

Edit Code Review Bot Settings | Greptile

Comment on lines +46 to +47
else:
return df.select("image").collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This else branch is unreachable since TEST_CASES only contains the three specific cases. Consider removing or adding test cases that would trigger this path.

Comment on lines +75 to +109
else:
return df["image"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Unreachable else branch - TEST_CASES doesn't contain cases that would reach this code.

Comment on lines +97 to +137
if agg_name == "sum_struct":
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Returning None but still running benchmark.pedantic() will benchmark None return, which may not provide meaningful results.

from daft.io import IOConfig, S3Config

DATASET_PATH = os.getenv("DATASET_PATH", "s3://xx/ai/dataset/ILSVRC2012/")
LANCE_PATH = os.getenv("LANCE_PATH", "s3://xx/ai/dataset/ILSVRC2012/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: LANCE_PATH variable is defined but never used in the script

)

df = daft.from_glob_path(f"{DATASET_PATH}*/*.JPEG", io_config)
df = daft.sql("SELECT * FROM df")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Redundant SQL SELECT statement - daft.sql("SELECT * FROM df") is unnecessary since df already contains all data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. This is unnecessary

Copy link

codecov bot commented Jul 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.33%. Comparing base (af3fdd9) to head (56b6b34).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4874      +/-   ##
==========================================
- Coverage   79.78%   79.33%   -0.45%     
==========================================
  Files         896      896              
  Lines      125817   125817              
==========================================
- Hits       100377    99814     -563     
- Misses      25440    26003     +563     

see 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Jay-ju Jay-ju force-pushed the lance_benchmark branch 2 times, most recently from aa1922f to f7f0bab Compare July 31, 2025 02:01
@desmondcheongzx desmondcheongzx self-requested a review August 4, 2025 16:32
Comment on lines +11 to +12
AWS_KEY_ID = os.getenv("AWS_KEY_ID", "xx")
AWS_ACCESS_KEY = os.getenv("AWS_ACCESS_KEY", "xx==")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the standard AWS env vars here. AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

)

df = daft.from_glob_path(f"{DATASET_PATH}*/*.JPEG", io_config)
df = daft.sql("SELECT * FROM df")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. This is unnecessary



@pytest.mark.parametrize("agg_name, agg_expr", TEST_CASES)
def test_duckdb_read_lance(benchmark, agg_name, agg_expr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this duckdb benchmark seems a bit unfair as it's not using lance, but arrow, and the lance to arrow conversion happens outside of the benchmark.

@universalmind303
Copy link
Contributor

@Jay-ju, feel free to ping me when these comments are addressed and I can take another look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants