Skip to content

feat: Add offset support to Spark Connect #4962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

plotor
Copy link
Contributor

@plotor plotor commented Aug 12, 2025

Changes Made

Related Issues

#3581
#4547

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the feat label Aug 12, 2025
Copy link

codecov bot commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 69.23077% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.86%. Comparing base (072858e) to head (4fffe6d).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-connect/src/spark_analyzer.rs 69.23% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4962      +/-   ##
==========================================
+ Coverage   76.29%   76.86%   +0.56%     
==========================================
  Files         918      918              
  Lines      128703   126910    -1793     
==========================================
- Hits        98195    97545     -650     
+ Misses      30508    29365    -1143     
Files with missing lines Coverage Δ
src/daft-connect/src/spark_analyzer.rs 80.17% <69.23%> (-0.22%) ⬇️

... and 56 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@plotor plotor force-pushed the zhenchao-offset-for-spark-connect-20250812 branch 2 times, most recently from de86fc2 to 1e5e079 Compare August 14, 2025 02:18
@plotor plotor marked this pull request as ready for review August 14, 2025 02:25
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds comprehensive OFFSET support to Daft's Spark Connect implementation, enabling pagination functionality that was previously missing. The implementation consists of three key components:

  1. Core Rust Implementation: Added offset functionality to spark_analyzer.rs by implementing an offset method that validates input parameters (ensuring non-negative values), converts to u64, and integrates with the logical plan builder. The implementation follows the same pattern as the existing limit method for consistency.

  2. SQL Query Support: Added test coverage in test_spark_sql.py that validates both OFFSET ... LIMIT and LIMIT ... OFFSET syntax variations work correctly with SQL queries. The tests use a large dataset (1024 rows) with shuffled data to ensure robust validation under realistic conditions.

  3. DataFrame API Support: Added programmatic offset testing in test_basics.py through the test_range_limit_offset function, which validates that df.offset() works correctly both independently and in combination with limit() operations.

The implementation addresses GitHub issues #3581 and #4547, providing essential pagination capabilities for Spark Connect users. The changes integrate seamlessly with Daft's existing logical plan architecture and maintain consistency with Spark's DataFrame API. All tests use shuffled data followed by ordering to ensure the offset mechanism works correctly regardless of initial data arrangement.

Confidence score: 4/5

  • This PR is safe to merge with minimal risk as it adds well-tested functionality without modifying existing behavior
  • Score reflects comprehensive test coverage and implementation that follows established patterns in the codebase
  • Pay close attention to the Rust implementation in spark_analyzer.rs to ensure proper error handling for edge cases

3 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@plotor plotor force-pushed the zhenchao-offset-for-spark-connect-20250812 branch from 1e5e079 to 42ebedc Compare August 14, 2025 03:20
@plotor
Copy link
Contributor Author

plotor commented Aug 14, 2025

Hi @universalmind303, I see you created the EPIC #3581, and this PR is to add offset support to Spark Connect. Please take your time to review it when you're free.

@plotor plotor force-pushed the zhenchao-offset-for-spark-connect-20250812 branch from 42ebedc to 4fffe6d Compare August 15, 2025 10:42
@plotor
Copy link
Contributor Author

plotor commented Aug 15, 2025

Hi @srilman, Since I don't have merge permission, could you help me merge this PR?

@srilman srilman merged commit ff31546 into Eventual-Inc:main Aug 15, 2025
51 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants