Skip to content

[SPARK-52494] Support colon-sign operator syntax to access Variant fields #51190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

haoyangeng-db
Copy link

What changes were proposed in this pull request?

Adds support for accessing fields inside a Variant data type through the colon-sign operator. The syntax is documented here: https://docs.databricks.com/aws/en/sql/language-manual/functions/colonsign

Why are the changes needed?

Provides a convenient way to access fields inside a Variant via SQL.

Does this PR introduce any user-facing change?

Yes -- The previously invalid (would throw ParseException) syntax is now supported.

=== In Scala Spark shell:

Before:

scala> spark.sql("SELECT PARSE_JSON('{ \"price\": 5 }'):price").collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near ':'. SQLSTATE: 42601 (line 1, pos 35)

== SQL ==
SELECT PARSE_JSON('{ "price": 5 }'):price
-----------------------------------^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:274)
  at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:97)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:54)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(AbstractSqlParser.scala:93)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$5(SparkSession.scala:492)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:491)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:490)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:504)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:513)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:91)
  ... 42 elided

After:

scala> spark.sql("SELECT PARSE_JSON('{ \"price\": 5 }'):price").collect
val res0: Array[org.apache.spark.sql.Row] = Array([5])

=== In PySpark REPL:

Before:

spark.sql("select parse_json('{ "price": 5 }'):price::int").collect()
Traceback (most recent call last):
File "", line 1, in
spark.sql("select parse_json('{ "price": 5 }'):price::int").collect()
~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/haoyan.geng/oss-scala/python/pyspark/sql/session.py", line 1810, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
File "/Users/haoyan.geng/oss-scala/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in call
return_value = get_return_value(
answer, self.gateway_client, self.target_id, self.name)
File "/Users/haoyan.geng/oss-scala/python/pyspark/errors/exceptions/captured.py", line 294, in deco
raise converted from None
pyspark.errors.exceptions.captured.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near ':'. SQLSTATE: 42601 (line 1, pos 35)

== SQL ==
select parse_json('{ "price": 5 }'):price::int
-----------------------------------^^^

After:

spark.sql("select parse_json('{ "price": 5 }'):price::int").collect()
[Row(price=5)]

How was this patch tested?

  • Added new test cases in SQLQueryTestSuite (sql/core/src/test/resources/sql-tests/inputs/variant-field-extractions.sql).
  • Manually tested the new behavior in Spark Shell (Scala) and PySpark REPL.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jun 16, 2025

/**
* Represents the extraction of data from a field that contains semi-structured data. The
* semi-structured format can be anything (JSON, key-value delimited, etc), and that information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be VARIANT only now

@@ -0,0 +1,13 @@
-- Simple field extraction and type casting.
select parse_json('{ "price": 5 }'):price;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can create a temp view with one or more VARIANT columns, to simplify the other SELECT queries in this test.

-- Applying an invalid function.
select parse_json('{ "price": 12345.678 }'):price::decimal(3, 2);
-- Access field in an array and feed it into functions.
select parse_json('{ "item": [ { "model" : "basic", "price" : 6.12 }, { "model" : "medium", "price" : 9.24 } ] }'):item[0].price::double;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's test all the valid syntaxes, e.g. ASTERISK, brackets with string, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants