Skip to content

[SPARK-52424] Add unit test to verify in Spark Connect, Column object defined outside of an UDF can be referenced properly. #51101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

haoyangeng-db
Copy link

What changes were proposed in this pull request?

Added a unit test to verify the case of a UDF referencing a Column object defined outside of it can be referenced from the UDF as intended.

Why are the changes needed?

This isn't really an intended usage pattern, but there are users implicitly relying on this behavior. So it'd cause issues for those users if this is unintentionally broken. Here we add a new unit test to ensure this case is covered.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This is a unit test only change.

Was this patch authored or co-authored using generative AI tooling?

No

@HyukjinKwon
Copy link
Member

Let's file a JIRA

@haoyangeng-db haoyangeng-db changed the title Add unit test to verify in Spark Connect, Column object defined outside of an UDF can be referenced properly. [SPARK-52424] Add unit test to verify in Spark Connect, Column object defined outside of an UDF can be referenced properly. Jun 9, 2025
@haoyangeng-db
Copy link
Author

Let's file a JIRA

Done -- SPARK-52424.

Comment on lines 145 to 146
# Define 'col' outside the UDF below, so with Spark Connect it'd have to be serialized.
col = sf.lit(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be defined outside of func and used in func, not f.

@@ -136,6 +138,29 @@ def func(df, _):
if q:
q.stop()

def test_streaming_foreach_batch_external_column(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to pyspark/sql/tests/streaming/test_streaming_foreach_batch.py? The test will be shared between classic and connect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also per discussion, removed the UDF part and made 'col' defined outside of the func.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this now as the super method will be called automatically.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants