perf: add some benchmark for lance #4874

Jay-ju · 2025-07-30T13:23:09Z

Changes Made

Related Issues

#4763
I brought up the results of running locally, which are approximately 1000 records.

---------------------------------------------------------------------------------------------------------------------------------- benchmark: 9 tests ---------------------------------------------------------------------------------------------------------------------------------
Name (time in ns)                                                     Min                         Max                        Mean                      StdDev                      Median                         IQR            Outliers             OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_duckdb_read_lance[sum_struct]                               149.0116 (1.0)            1,065.8987 (1.0)              357.9073 (1.0)              399.0490 (1.0)              154.1339 (1.0)              317.5810 (1.0)           1;1  2,794,019.8387 (1.0)           5           1
test_lance_native_read[count-COUNT(1)]                     2,408,016.9387 (>1000.0)    4,414,444.7893 (>1000.0)    2,919,809.4422 (>1000.0)      403,644.6835 (>1000.0)    2,797,108.6092 (>1000.0)      318,493.1120 (>1000.0)     40;21        342.4881 (0.00)        247           1
test_duckdb_read_lance[sum_int]                            2,776,946.8725 (>1000.0)    3,177,690.3197 (>1000.0)    3,046,105.6158 (>1000.0)      164,196.4137 (411.47)     3,100,662.0266 (>1000.0)      214,710.2496 (676.08)        1;0        328.2880 (0.00)          5           1
test_duckdb_read_lance[count]                              2,948,476.0016 (>1000.0)    4,813,462.9615 (>1000.0)    3,923,549.8756 (>1000.0)      710,553.0621 (>1000.0)    3,832,014.2776 (>1000.0)      999,579.2061 (>1000.0)       2;0        254.8712 (0.00)          5           1
test_lance_native_read[sum_int-SUM(size)]                 17,664,573.1553 (>1000.0)   28,122,316.1146 (>1000.0)   21,726,866.9396 (>1000.0)    2,948,127.6718 (>1000.0)   21,872,717.1887 (>1000.0)    5,108,494.3116 (>1000.0)      16;0         46.0260 (0.00)         40           1
test_daft_read_lance[count-COUNT(1)]                      36,386,939.6970 (>1000.0)   38,619,516.8830 (>1000.0)   37,631,585.7284 (>1000.0)      914,912.5088 (>1000.0)   37,788,018.1186 (>1000.0)    1,503,030.3039 (>1000.0)       2;0         26.5734 (0.00)          5           1
test_daft_read_lance[sum_int-SUM(size)]                   39,596,260.9909 (>1000.0)   42,440,177.8728 (>1000.0)   40,978,406.6491 (>1000.0)    1,050,849.4367 (>1000.0)   41,084,280.2376 (>1000.0)    1,330,918.7489 (>1000.0)       2;0         24.4031 (0.00)          5           1
test_lance_native_read[sum_struct-SUM(image.height)]     307,479,785.7553 (>1000.0)  387,991,133.1460 (>1000.0)  338,949,221.2795 (>1000.0)   31,716,360.5403 (>1000.0)  342,062,475.1598 (>1000.0)   41,463,551.9264 (>1000.0)       1;0          2.9503 (0.00)          5           1
test_daft_read_lance[sum_struct-SUM(image.height)]       323,219,778.9475 (>1000.0)  878,646,744.8808 (>1000.0)  697,738,404.6651 (>1000.0)  222,157,485.2833 (>1000.0)  740,964,299.9992 (>1000.0)  251,344,511.1690 (>1000.0)       1;0          1.4332 (0.00)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

greptile-apps

Greptile Summary

This PR introduces a new benchmarking suite for Daft's Lance data source integration. The changes add a complete benchmarking framework under benchmarking/lance/ that compares Daft's performance against DuckDB and Lance native operations when working with Lance datasets.

The implementation follows the existing benchmarking patterns in the repository (similar to the parquet benchmarks). The suite includes:

Data Generation: generate.py creates test datasets by downloading ImageNet JPEG images from S3, decoding them with Daft's image processing capabilities, and writing them to Lance format
Performance Tests: test_query.py implements pytest-benchmark tests comparing count and aggregation operations across different frameworks
Dependencies: benchmark-requirements.txt pins specific versions of testing tools (pytest-benchmark, pytest-memray) and competing frameworks (DuckDB, Datafusion)
Documentation: README.md provides setup instructions and workflow guidance
Package Structure: __init__.py makes the directory a proper Python package

The benchmarking methodology focuses on common data operations like counting, sum aggregations on primitive types, and sum operations on struct fields. This fits into Daft's broader performance optimization efforts by providing quantitative comparisons against established data processing frameworks when working with Lance's columnar storage format.

PR Description Notes:

The PR description is incomplete - it lacks details about what changes were made and why
All checklist items remain unchecked
No related issues are referenced

Confidence score: 2/5

This PR has significant security and reliability issues that need addressing before merge
Hardcoded placeholder credentials in generate.py pose security risks, and there are several code quality issues including unreachable code paths and missing error handling
Files needing attention: generate.py (security credentials), test_query.py (unreachable code, hardcoded paths), README.md (missing diagram, placeholder credentials in docs)

_{5 files reviewed, 5 comments}

_{Edit Code Review Bot Settings | Greptile}

greptile-apps · 2025-07-30T13:24:01Z

benchmarking/lance/test_query.py

+        else:
+            return df.select("image").collect()


logic: This else branch is unreachable since TEST_CASES only contains the three specific cases. Consider removing or adding test cases that would trigger this path.

greptile-apps · 2025-07-30T13:24:01Z

benchmarking/lance/test_query.py

+        else:
+            return df["image"]


logic: Unreachable else branch - TEST_CASES doesn't contain cases that would reach this code.

greptile-apps · 2025-07-30T13:24:02Z

benchmarking/lance/test_query.py

+        if agg_name == "sum_struct":
+            return None


logic: Returning None but still running benchmark.pedantic() will benchmark None return, which may not provide meaningful results.

greptile-apps · 2025-07-30T13:24:16Z

benchmarking/lance/generate.py

+from daft.io import IOConfig, S3Config
+
+DATASET_PATH = os.getenv("DATASET_PATH", "s3://xx/ai/dataset/ILSVRC2012/")
+LANCE_PATH = os.getenv("LANCE_PATH", "s3://xx/ai/dataset/ILSVRC2012/")


style: LANCE_PATH variable is defined but never used in the script

greptile-apps · 2025-07-30T13:24:17Z

benchmarking/lance/generate.py

+)
+
+df = daft.from_glob_path(f"{DATASET_PATH}*/*.JPEG", io_config)
+df = daft.sql("SELECT *  FROM df")


style: Redundant SQL SELECT statement - daft.sql("SELECT * FROM df") is unnecessary since df already contains all data

agree. This is unnecessary

codecov · 2025-07-30T14:04:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.33%. Comparing base (af3fdd9) to head (56b6b34).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4874      +/-   ##
==========================================
- Coverage   79.78%   79.33%   -0.45%     
==========================================
  Files         896      896              
  Lines      125817   125817              
==========================================
- Hits       100377    99814     -563     
- Misses      25440    26003     +563

see 11 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

universalmind303 · 2025-08-06T15:15:33Z

benchmarking/lance/generate.py

+AWS_KEY_ID = os.getenv("AWS_KEY_ID", "xx")
+AWS_ACCESS_KEY = os.getenv("AWS_ACCESS_KEY", "xx==")


can we use the standard AWS env vars here. AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

universalmind303 · 2025-08-06T15:16:18Z

benchmarking/lance/generate.py

+)
+
+df = daft.from_glob_path(f"{DATASET_PATH}*/*.JPEG", io_config)
+df = daft.sql("SELECT *  FROM df")


agree. This is unnecessary

universalmind303 · 2025-08-06T15:18:33Z

benchmarking/lance/test_query.py

+
+
+@pytest.mark.parametrize("agg_name, agg_expr", TEST_CASES)
+def test_duckdb_read_lance(benchmark, agg_name, agg_expr):


this duckdb benchmark seems a bit unfair as it's not using lance, but arrow, and the lance to arrow conversion happens outside of the benchmark.

universalmind303 · 2025-08-11T20:22:32Z

@Jay-ju, feel free to ping me when these comments are addressed and I can take another look!

github-actions bot added the perf label Jul 30, 2025

greptile-apps bot reviewed Jul 30, 2025

View reviewed changes

Jay-ju mentioned this pull request Jul 30, 2025

Tracking issues for Lance Integration #4857

Open

Jay-ju force-pushed the lance_benchmark branch from 0eed445 to d43d329 Compare July 30, 2025 13:33

Jay-ju force-pushed the lance_benchmark branch 2 times, most recently from aa1922f to f7f0bab Compare July 31, 2025 02:01

desmondcheongzx self-requested a review August 4, 2025 16:32

perf: add some benchmark for lance

56b6b34

Jay-ju force-pushed the lance_benchmark branch from f7f0bab to 56b6b34 Compare August 6, 2025 06:19

universalmind303 reviewed Aug 6, 2025

View reviewed changes

huleilei mentioned this pull request Aug 12, 2025

WIP: feat(optimizer): Add Lance count() pushdown optimizatio #4899

Closed

4 tasks

desmondcheongzx removed their request for review August 13, 2025 00:34

huleilei mentioned this pull request Aug 13, 2025

feat(optimizer): Add Lance count() pushdown optimization #4969

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: add some benchmark for lance #4874

perf: add some benchmark for lance #4874

Uh oh!

Jay-ju commented Jul 30, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jul 30, 2025

Uh oh!

greptile-apps bot Jul 30, 2025

Uh oh!

greptile-apps bot Jul 30, 2025

Uh oh!

greptile-apps bot Jul 30, 2025

Uh oh!

greptile-apps bot Jul 30, 2025

Uh oh!

universalmind303 Aug 6, 2025

Uh oh!

codecov bot commented Jul 30, 2025 •

edited

Loading

Uh oh!

universalmind303 Aug 6, 2025

Uh oh!

universalmind303 Aug 6, 2025

Uh oh!

universalmind303 Aug 6, 2025

Uh oh!

universalmind303 commented Aug 11, 2025

Uh oh!

Uh oh!

		AWS_KEY_ID = os.getenv("AWS_KEY_ID", "xx")
		AWS_ACCESS_KEY = os.getenv("AWS_ACCESS_KEY", "xx==")



		@pytest.mark.parametrize("agg_name, agg_expr", TEST_CASES)
		def test_duckdb_read_lance(benchmark, agg_name, agg_expr):

perf: add some benchmark for lance #4874

Are you sure you want to change the base?

perf: add some benchmark for lance #4874

Uh oh!

Conversation

Jay-ju commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Related Issues

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 2/5

Uh oh!

greptile-apps bot Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

universalmind303 Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

universalmind303 Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

universalmind303 Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

universalmind303 Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

universalmind303 commented Aug 11, 2025

Uh oh!

Uh oh!

Jay-ju commented Jul 30, 2025 •

edited

Loading

codecov bot commented Jul 30, 2025 •

edited

Loading