-
Notifications
You must be signed in to change notification settings - Fork 253
feat: Add offset support to Spark Connect #4962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add offset support to Spark Connect #4962
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4962 +/- ##
==========================================
+ Coverage 76.29% 76.86% +0.56%
==========================================
Files 918 918
Lines 128703 126910 -1793
==========================================
- Hits 98195 97545 -650
+ Misses 30508 29365 -1143
🚀 New features to boost your workflow:
|
de86fc2
to
1e5e079
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR adds comprehensive OFFSET support to Daft's Spark Connect implementation, enabling pagination functionality that was previously missing. The implementation consists of three key components:
-
Core Rust Implementation: Added offset functionality to
spark_analyzer.rs
by implementing anoffset
method that validates input parameters (ensuring non-negative values), converts to u64, and integrates with the logical plan builder. The implementation follows the same pattern as the existinglimit
method for consistency. -
SQL Query Support: Added test coverage in
test_spark_sql.py
that validates bothOFFSET ... LIMIT
andLIMIT ... OFFSET
syntax variations work correctly with SQL queries. The tests use a large dataset (1024 rows) with shuffled data to ensure robust validation under realistic conditions. -
DataFrame API Support: Added programmatic offset testing in
test_basics.py
through thetest_range_limit_offset
function, which validates thatdf.offset()
works correctly both independently and in combination withlimit()
operations.
The implementation addresses GitHub issues #3581 and #4547, providing essential pagination capabilities for Spark Connect users. The changes integrate seamlessly with Daft's existing logical plan architecture and maintain consistency with Spark's DataFrame API. All tests use shuffled data followed by ordering to ensure the offset mechanism works correctly regardless of initial data arrangement.
Confidence score: 4/5
- This PR is safe to merge with minimal risk as it adds well-tested functionality without modifying existing behavior
- Score reflects comprehensive test coverage and implementation that follows established patterns in the codebase
- Pay close attention to the Rust implementation in
spark_analyzer.rs
to ensure proper error handling for edge cases
3 files reviewed, no comments
1e5e079
to
42ebedc
Compare
Hi @universalmind303, I see you created the EPIC #3581, and this PR is to add |
Signed-off-by: plotor <[email protected]>
42ebedc
to
4fffe6d
Compare
Hi @srilman, Since I don't have merge permission, could you help me merge this PR? |
Changes Made
Related Issues
#3581
#4547
Checklist
docs/mkdocs.yml
navigation