Skip to content

Commit 13a7da9

Browse files
committed
add flag
add flag add flag
1 parent 7d0b6bd commit 13a7da9

File tree

2 files changed

+25
-2
lines changed

2 files changed

+25
-2
lines changed

python/docs/source/migration_guide/pyspark_upgrade.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@ Upgrading PySpark
2222
Upgrading from PySpark 4.0 to 4.1
2323
---------------------------------
2424

25+
* In Spark 4.1, ``DataFrame['name']`` and ``DataFrame.name`` on Spark Connect Python Client no longer eagerly validate the column name. To restore the legacy behavior, set ``PYSPARK_VALIDATE_COLUMN_NAME_LEGACY`` environment variable to ``1``.
2526
* In Spark 4.1, Arrow-optimized Python UDF supports UDT input / output instead of falling back to the regular UDF. To restore the legacy behavior, set ``spark.sql.execution.pythonUDF.arrow.legacy.fallbackOnUDT`` to ``true``.
26-
2727
* In Spark 4.1, unnecessary conversion to pandas instances is removed when ``spark.sql.execution.pythonUDTF.arrow.enabled`` is enabled. As a result, the type coercion changes when the produced output has a schema different from the specified schema. To restore the previous behavior, enable ``spark.sql.legacy.execution.pythonUDTF.pandas.conversion.enabled``.
2828

2929

python/pyspark/sql/connect/dataframe.py

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@
4444
)
4545

4646
import copy
47+
import os
4748
import sys
4849
import random
4950
import pyarrow as pa
@@ -1703,7 +1704,10 @@ def __getattr__(self, name: str) -> "Column":
17031704
errorClass="JVM_ATTRIBUTE_NOT_SUPPORTED", messageParameters={"attr_name": name}
17041705
)
17051706

1706-
if name not in self.columns:
1707+
if (
1708+
os.environ.get("PYSPARK_VALIDATE_COLUMN_NAME_LEGACY") == "1"
1709+
and name not in self.columns
1710+
):
17071711
raise PySparkAttributeError(
17081712
errorClass="ATTRIBUTE_NOT_SUPPORTED", messageParameters={"attr_name": name}
17091713
)
@@ -1732,6 +1736,25 @@ def __getitem__(
17321736
)
17331737
)
17341738
else:
1739+
# TODO: revisit classic Spark's Dataset.col
1740+
# if (sparkSession.sessionState.conf.supportQuotedRegexColumnName) {
1741+
# colRegex(colName)
1742+
# } else {
1743+
# ConnectColumn(addDataFrameIdToCol(resolve(colName)))
1744+
# }
1745+
1746+
# validate the column name
1747+
if (
1748+
os.environ.get("PYSPARK_VALIDATE_COLUMN_NAME_LEGACY") == "1"
1749+
and not hasattr(self._session, "is_mock_session"),
1750+
):
1751+
from pyspark.sql.connect.types import verify_col_name
1752+
1753+
# Try best to verify the column name with cached schema
1754+
# If fails, fall back to the server side validation
1755+
if not verify_col_name(item, self._schema):
1756+
self.select(item).isLocal()
1757+
17351758
return self._col(item)
17361759
elif isinstance(item, Column):
17371760
return self.filter(item)

0 commit comments

Comments
 (0)