-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52424] Add unit test to verify in Spark Connect, Column object defined outside of an UDF can be referenced properly. #51101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Let's file a JIRA |
Done -- SPARK-52424. |
# Define 'col' outside the UDF below, so with Spark Connect it'd have to be serialized. | ||
col = sf.lit(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be defined outside of func
and used in func
, not f
.
…tside of an UDF can be serialized properly.
@@ -136,6 +138,29 @@ def func(df, _): | |||
if q: | |||
q.stop() | |||
|
|||
def test_streaming_foreach_batch_external_column(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this to pyspark/sql/tests/streaming/test_streaming_foreach_batch.py
? The test will be shared between classic and connect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also per discussion, removed the UDF part and made 'col' defined outside of the func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this now as the super method will be called automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
What changes were proposed in this pull request?
Added a unit test to verify the case of a UDF referencing a Column object defined outside of it can be referenced from the UDF as intended.
Why are the changes needed?
This isn't really an intended usage pattern, but there are users implicitly relying on this behavior. So it'd cause issues for those users if this is unintentionally broken. Here we add a new unit test to ensure this case is covered.
Does this PR introduce any user-facing change?
No
How was this patch tested?
This is a unit test only change.
Was this patch authored or co-authored using generative AI tooling?
No