You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-52402][PS] Fix divide-by-zero errors in Kendall and Pearson correlation under ANSI mode
### What changes were proposed in this pull request?
Fix divide-by-zero error in groupby().corr('kendall') with ANSI mode enabled
### Why are the changes needed?
Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52169.
### Does this PR introduce _any_ user-facing change?
Yes
```py
>>> ps.set_option("compute.fail_on_ansi_mode", False)
>>> ps.set_option("compute.ansi_mode_support", True)
>>> df = ps.DataFrame(
... {"A": [0, 0, 0, 1, 1, 2], "B": [-1, 2, 3, 5, 6, 0], "C": [4, 6, 5, 1, 3, 0]},
... columns=["A", "B", "C"]
... )
```
FROM
```py
>>> df.groupby("A").corr('kendall')
25/06/04 14:40:03 ERROR Executor: Exception in task 0.0 in stage 13.0 (TID 51)
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== DataFrame ==
"__truediv__" was called from
...
```
TO
```py
>>> df.groupby("A").corr('kendall')
B C
A
0 B 1.000000 0.333333
C 0.333333 1.000000
1 B 1.000000 1.000000
C 1.000000 1.000000
2 B 1.000000 NaN
C NaN 1.000000
```
### How was this patch tested?
Unit tests
### Was this patch authored or co-authored using generative AI tooling?
No
Closes#51090 from xinrong-meng/ansi_corr.
Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Xinrong Meng <[email protected]>
0 commit comments