[SPARK-52494] Support colon-sign operator syntax to access Variant fields #51190

haoyangeng-db · 2025-06-16T21:41:06Z

What changes were proposed in this pull request?

Adds support for accessing fields inside a Variant data type through the colon-sign operator. The syntax is documented here: https://docs.databricks.com/aws/en/sql/language-manual/functions/colonsign

Why are the changes needed?

Provides a convenient way to access fields inside a Variant via SQL.

Does this PR introduce any user-facing change?

Yes -- The previously invalid (would throw ParseException) syntax is now supported.

=== In Scala Spark shell:

Before:

scala> spark.sql("SELECT PARSE_JSON('{ \"price\": 5 }'):price").collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near ':'. SQLSTATE: 42601 (line 1, pos 35)

== SQL ==
SELECT PARSE_JSON('{ "price": 5 }'):price
-----------------------------------^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:274)
  at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:97)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:54)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(AbstractSqlParser.scala:93)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$5(SparkSession.scala:492)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:491)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:490)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:504)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:513)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:91)
  ... 42 elided

After:

scala> spark.sql("SELECT PARSE_JSON('{ \"price\": 5 }'):price").collect
val res0: Array[org.apache.spark.sql.Row] = Array([5])

=== In PySpark REPL:

Before:

spark.sql("select parse_json('{ "price": 5 }'):price::int").collect()
Traceback (most recent call last):
File "", line 1, in
spark.sql("select parse_json('{ "price": 5 }'):price::int").collect()
~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/haoyan.geng/oss-scala/python/pyspark/sql/session.py", line 1810, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
File "/Users/haoyan.geng/oss-scala/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in call
return_value = get_return_value(
answer, self.gateway_client, self.target_id, self.name)
File "/Users/haoyan.geng/oss-scala/python/pyspark/errors/exceptions/captured.py", line 294, in deco
raise converted from None
pyspark.errors.exceptions.captured.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near ':'. SQLSTATE: 42601 (line 1, pos 35)

== SQL ==
select parse_json('{ "price": 5 }'):price::int
-----------------------------------^^^

After:

spark.sql("select parse_json('{ "price": 5 }'):price::int").collect()
[Row(price=5)]

How was this patch tested?

Added new test cases in SQLQueryTestSuite (sql/core/src/test/resources/sql-tests/inputs/variant-field-extractions.sql).
Manually tested the new behavior in Spark Shell (Scala) and PySpark REPL.

Was this patch authored or co-authored using generative AI tooling?

No

…ields.

cloud-fan · 2025-06-19T01:05:57Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SemiStructuredExtract.scala

+
+/**
+ * Represents the extraction of data from a field that contains semi-structured data. The
+ * semi-structured format can be anything (JSON, key-value delimited, etc), and that information


it can be VARIANT only now

cloud-fan · 2025-06-19T01:08:31Z

sql/core/src/test/resources/sql-tests/inputs/variant-field-extractions.sql

@@ -0,0 +1,13 @@
+-- Simple field extraction and type casting.
+select parse_json('{ "price": 5 }'):price;


nit: we can create a temp view with one or more VARIANT columns, to simplify the other SELECT queries in this test.

cloud-fan · 2025-06-19T01:10:02Z

sql/core/src/test/resources/sql-tests/inputs/variant-field-extractions.sql

+-- Applying an invalid function.
+select parse_json('{ "price": 12345.678 }'):price::decimal(3, 2);
+-- Access field in an array and feed it into functions.
+select parse_json('{ "item": [ { "model" : "basic", "price" : 6.12 }, { "model" : "medium", "price" : 9.24 } ] }'):item[0].price::double;


let's test all the valid syntaxes, e.g. ASTERISK, brackets with string, etc.

[SPARK-52494] Support colon-sign opeorator syntax to access Variant f…

594c29b

…ields.

github-actions bot added the SQL label Jun 16, 2025

cloud-fan reviewed Jun 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52494] Support colon-sign operator syntax to access Variant fields #51190

[SPARK-52494] Support colon-sign operator syntax to access Variant fields #51190

Uh oh!

haoyangeng-db commented Jun 16, 2025

Uh oh!

cloud-fan Jun 19, 2025

Uh oh!

cloud-fan Jun 19, 2025

Uh oh!

cloud-fan Jun 19, 2025

Uh oh!

Uh oh!

		@@ -0,0 +1,13 @@
		-- Simple field extraction and type casting.
		select parse_json('{ "price": 5 }'):price;

[SPARK-52494] Support colon-sign operator syntax to access Variant fields #51190

Are you sure you want to change the base?

[SPARK-52494] Support colon-sign operator syntax to access Variant fields #51190

Uh oh!

Conversation

haoyangeng-db commented Jun 16, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!