-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52492][SQL] Make InMemoryRelation.convertToColumnarIfPossible customizable #51189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-52492][SQL] Make InMemoryRelation.convertToColumnarIfPossible customizable #51189
Conversation
* @return The output plan. Could either be a columnar plan if the input plan is convertible, or | ||
* the input plan unchanged if no viable conversion can be done. | ||
*/ | ||
@Since("4.0.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the Apache Spark backporting policy, this should be 4.1.0 because only bug-fixes are allowed for branch-4.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Addressed and also updated Target Version/s
field in the JIRA ticket.
32a0fc7
to
63971eb
Compare
63971eb
to
a353773
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sql/core/src/main/scala/org/apache/spark/sql/columnar/CachedBatchSerializer.scala
Show resolved
Hide resolved
…tchSerializer.scala
@yaooqinn Thanks for amending the annotation! |
Merged to master, thank you @zhztheplayer @dongjoon-hyun |
https://issues.apache.org/jira/browse/SPARK-52492
What changes were proposed in this pull request?
This PR moves
InMemoryRelation.convertToColumnarIfPossible
to as a public API ofCachedBatchSerializer
.Why are the changes needed?
TL;DR: So that plugins like Gluten could have the relevant logic customized for their own catch serializers.
Currently, InMemoryRelation.convertToColumnarIfPossible is highly coupled with vanilla Spark's columnar processing logic. It unwraps the input columnar plan by removing the topmost ColumnarToRowExec, the assumes that the outcome RDD after this process can be recognized by the user-customized cache serializer.
But sometimes this assertion is invalid. As in the Apache Gluten project, we may continue distiguishing plans that are all have
supportsColumnar=true
with different columnar batch types. So even the topmostColumnarToRowExec
is removed, we still don't know whether the columnar RDD unwrapped can be accepted by Gluten's cache serializer (assuming it only handles one certain type of columnar batch or something).So in Gluten we had a rule to workaround the logic in InMemoryRelation.convertToColumnarIfPossible: https://github.com/apache/incubator-gluten/blob/c6461b4e0c7d3022a31fa832aeab588b1a3200e6/gluten-substrait/src/main/scala/org/apache/gluten/extension/columnar/MiscColumnarRules.scala#L192-L217. This is the best way we had thought about to get through the issue but it's still not elegant, especially the rule is even caller-sensitive as it needs to determine whether it's called in the caching planning process or not.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
With the added UTs.