Skip to content

[SPARK-52492][SQL] Make InMemoryRelation.convertToColumnarIfPossible customizable #51189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

zhztheplayer
Copy link
Member

@zhztheplayer zhztheplayer commented Jun 16, 2025

https://issues.apache.org/jira/browse/SPARK-52492

What changes were proposed in this pull request?

This PR moves InMemoryRelation.convertToColumnarIfPossible to as a public API of CachedBatchSerializer.

Why are the changes needed?

TL;DR: So that plugins like Gluten could have the relevant logic customized for their own catch serializers.

Currently, InMemoryRelation.convertToColumnarIfPossible is highly coupled with vanilla Spark's columnar processing logic. It unwraps the input columnar plan by removing the topmost ColumnarToRowExec, the assumes that the outcome RDD after this process can be recognized by the user-customized cache serializer.

But sometimes this assertion is invalid. As in the Apache Gluten project, we may continue distiguishing plans that are all have supportsColumnar=true with different columnar batch types. So even the topmost ColumnarToRowExec is removed, we still don't know whether the columnar RDD unwrapped can be accepted by Gluten's cache serializer (assuming it only handles one certain type of columnar batch or something).

So in Gluten we had a rule to workaround the logic in InMemoryRelation.convertToColumnarIfPossible: https://github.com/apache/incubator-gluten/blob/c6461b4e0c7d3022a31fa832aeab588b1a3200e6/gluten-substrait/src/main/scala/org/apache/gluten/extension/columnar/MiscColumnarRules.scala#L192-L217. This is the best way we had thought about to get through the issue but it's still not elegant, especially the rule is even caller-sensitive as it needs to determine whether it's called in the caching planning process or not.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

With the added UTs.

@github-actions github-actions bot added the SQL label Jun 16, 2025
* @return The output plan. Could either be a columnar plan if the input plan is convertible, or
* the input plan unchanged if no viable conversion can be done.
*/
@Since("4.0.1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Apache Spark backporting policy, this should be 4.1.0 because only bug-fixes are allowed for branch-4.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Addressed and also updated Target Version/s field in the JIRA ticket.

@zhztheplayer zhztheplayer force-pushed the wip-cache-customize-unwrap branch from 32a0fc7 to 63971eb Compare June 17, 2025 04:14
@zhztheplayer zhztheplayer force-pushed the wip-cache-customize-unwrap branch from 63971eb to a353773 Compare June 17, 2025 08:36
Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Jun 18, 2025

@yaooqinn Thanks for amending the annotation!

@yaooqinn yaooqinn closed this in 44d9fce Jun 19, 2025
@yaooqinn
Copy link
Member

Merged to master, thank you @zhztheplayer @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants