Skip to content

feat: unify all Daft type to Python type conversions #4972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 15, 2025

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Aug 13, 2025

Changes Made

We used to have five separate code paths for converting Daft types to Python:

  1. Literal -> Python
  2. Series.to_pylist
  3. Series.__iter__
  4. the SQL planner's lit_to_py_any
  5. Series casting logic

This PR makes all of them go through the literal to Python code path, making the conversion logic consistent across our library.

Also added extension type to literal.

TODO:

  • add conversion mapping to docs

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements a significant architectural refactoring that unifies all Daft-to-Python type conversions through a single code path. Previously, the codebase had five separate conversion mechanisms: literal-to-Python, Series.to_pylist, Series.__iter__, SQL planner's lit_to_py_any, and Series casting logic. Each path potentially handled conversions differently, leading to inconsistencies.

The changes consolidate all conversions to use the literal-to-Python pathway implemented in Rust. Key modifications include:

  1. New Extension literal support: Added Extension(Series) variant to the Literal enum to handle extension types as single-element series

  2. Unified conversion infrastructure: Created a new common-ndarray crate with NdArray trait for dynamic dispatch over ndarray types, enabling consistent numpy array conversions

  3. Simplified Python bindings: Removed the complex SeriesIterable class and _cast_to_python() method in favor of direct calls to Rust implementations

  4. Consolidated casting logic: Replaced hundreds of lines of specialized casting code in cast.rs with a single cast_to_python method that uses self.to_literals().map(|lit| lit.into_py_any(py))

  5. Enhanced literal conversions: Updated python.rs to handle previously unimplemented types (Tensor, SparseTensor, Embedding, Image, Extension) and improved timestamp/date handling

The refactoring introduces a breaking change where list and fixed-size list types now convert to daft.Series objects instead of Python lists. This maintains Daft's native data structures and avoids lossy conversions while keeping data in a more efficient columnar format. Tests were updated throughout to handle this behavioral change by calling .to_pylist() on Series objects when Python list comparison is needed.

Confidence score: 3/5

  • This PR introduces breaking changes with complex type conversion logic that could affect user code in unexpected ways
  • Score reflects the architectural complexity and potential for subtle bugs in the unified conversion system
  • Pay close attention to test files and the core literal conversion implementation in src/daft-core/src/lit/python.rs

21 files reviewed, 2 comments

Edit Code Review Bot Settings | Greptile

@kevinzwang kevinzwang marked this pull request as draft August 13, 2025 22:09
Copy link

codspeed-hq bot commented Aug 13, 2025

CodSpeed Performance Report

Merging #4972 will improve performances by ×3.3

Comparing kevin/unify-daft-to-py (9d5e777) with main (4f14d6a)

Summary

⚡ 1 improvements
✅ 23 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
test_iter_rows_first_row[1 Small File] 125 ms 38.1 ms ×3.3

@kevinzwang kevinzwang changed the title feat!: unify all Daft type to Python type conversions feat: unify all Daft type to Python type conversions Aug 14, 2025
@kevinzwang kevinzwang marked this pull request as ready for review August 14, 2025 19:55
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

Note: This review covers only the changes made since the last review (commit 783f8c0), not the entire PR.

The recent changes focus on three key areas: (1) Method naming consistency - renames into_bound_py_any to into_py in the NdArray trait to align with the unified conversion naming pattern, (2) Test updates for new conversion behavior - updates multiple test files to handle the fact that map arrays now return lists of key-value tuples instead of dictionary-like objects, and that Series iteration returns native Python types rather than wrapped Series objects, and (3) SparseTensor indices standardization - changes SparseTensor indices from Vec<u64> to Series objects to maintain consistency with other tensor-like structures in the unified conversion system.

These changes support the PR's core objective of consolidating five separate Daft-to-Python conversion paths into a single unified approach through the literal conversion system. The modifications ensure consistent behavior across Series.to_pylist(), Series.__iter__(), literal conversions, SQL planner conversions, and Series casting logic. The test updates reflect behavioral changes where map types now return lists of tuples rather than dictionaries, and iteration over complex types returns native Python objects directly.

Confidence score: 3/5

  • This PR introduces behavioral changes that could break existing code, particularly the map array representation change from dictionary-like to list of tuples
  • Score reflects concerns about potential runtime panics in SparseTensor and Extension type conversion logic that assume single-element Series without proper bounds checking
  • Pay close attention to test files and SparseTensor literal handling code for potential runtime safety issues

8 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@kevinzwang kevinzwang requested a review from ccmao1130 as a code owner August 14, 2025 21:48
@kevinzwang kevinzwang merged commit ef6ae49 into main Aug 15, 2025
55 checks passed
@kevinzwang kevinzwang deleted the kevin/unify-daft-to-py branch August 15, 2025 04:28
Copy link

codecov bot commented Aug 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (c555b0c) to head (9d5e777).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #4972       +/-   ##
==========================================
- Coverage   79.14%       0   -79.15%     
==========================================
  Files         917       0      -917     
  Lines      126969       0   -126969     
==========================================
- Hits       100486       0   -100486     
+ Misses      26483       0    -26483     

see 917 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants