Skip to content

[SPARK-52770][CONNECT] Support TIME type in connect proto #51464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Jul 13, 2025

What changes were proposed in this pull request?

This PR adds TIME type support to Spark Connect.

Why are the changes needed?

TIME type is new in 4.1 and Spark Connect should support it.

Does this PR introduce any user-facing change?

Yes, it adds basic TIME type support.

How was this patch tested?

TODO.

Was this patch authored or co-authored using generative AI tooling?

No.

@MaxGekk
Copy link
Member

MaxGekk commented Jul 13, 2025

@dengziming Could you review this PR since you are working on similar one: #51462

Copy link
Member

@dengziming dengziming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @peter-toth for this PR, I think we should add a test case for this change, and I don't think your test case makes sense, I duplicated it locally and it run smoothly even without this change, take a look at the image below.

image

@@ -127,6 +129,11 @@ message DataType {
uint32 type_variation_reference = 1;
}

message Time {
int32 precision = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make precision optional analogous to Decimal.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @peter-toth .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the PR description looks wrong to me.

# How was this patch tested?

Manually:

scala> sql("SELECT TIME '12:13:14'").show()
+---------------+
|TIME '12:13:14'|
+---------------+
|       12:13:14|
+---------------+

As reported on the JIRA issue, show is handled by differently and works already. collect was the problem.

$ bin/spark-connect-shell --remote sc://localhost:15002
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.1.0-preview1
      /_/

Type in expressions to have them evaluated.
Spark connect server version 4.1.0-preview1.
Spark session available as 'spark'.

scala> sql("SELECT TIME '12:13:14'").show()
+---------------+
|TIME '12:13:14'|
+---------------+
|       12:13:14|
+---------------+

scala> sql("SELECT TIME '12:13:14'").collect()
org.apache.spark.SparkException: org.apache.spark.sql.connect.common.InvalidPlanInput: [INTERNAL_ERROR] Does not support convert time(6) to connect proto types. SQLSTATE: XX000

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test case like the above comment.

@dengziming
Copy link
Member

The difference is that we need to return the result schema which is represented by DataType, when calling show the schema is converted to StringType so it will work well, whereas when calling collect we need TimeType, this can't be tested by the tests I provided before.

-- sql("SELECT TIME '12:13:14'").show()
root {
  show_string {
    input {
      sql {
        query: "SELECT TIME \'12:13:14\'"
      }
    }
    num_rows: 20
    truncate: 20
    vertical: true
  }
}

-- sql("SELECT TIME '12:13:14'").collect()
root {
  sql {
    query: "SELECT TIME \'12:13:14\'"
  }
}

@peter-toth
Copy link
Contributor Author

peter-toth commented Jul 14, 2025

@dongjoon-hyun, ahh indeed, I ran the wrong test. The proper type support needs a bit more changes.

The difference is that we need to return the result schema which is represented by DataType, when calling show the schema is converted to StringType so it will work well, whereas when calling collect we need TimeType, this can't be tested by the tests I provided before.

@dengziming, that's expected.

@peter-toth peter-toth force-pushed the SPARK-52770-suppot-time-in-connect-proto branch from 68a1563 to 66361ef Compare July 14, 2025 12:11
@peter-toth peter-toth force-pushed the SPARK-52770-suppot-time-in-connect-proto branch from 66361ef to b16793b Compare July 14, 2025 13:18
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating the PR, @peter-toth .

@peter-toth
Copy link
Contributor Author

peter-toth commented Jul 14, 2025

Thank you for updating the PR, @peter-toth .

Sure. I will add some more tests and adjust the description as well, but I don't see yet what would be the best place for the sql("SELECT TIME '12:13:14'").collect() test.

@@ -69,6 +69,8 @@ message DataType {

// UnparsedDataType
Unparsed unparsed = 24;

Time time = 28;
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding. Do you know where is 25, 26, 27, @peter-toth ?

Copy link
Contributor Author

@peter-toth peter-toth Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately these are not in order. A few lines up:

  Variant variant = 25;

and down:

  // Reserved for geometry and geography types
  reserved 26, 27;

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM from my side. Thank you, @peter-toth .

cc @MaxGekk , @grundprinzip , @HyukjinKwon

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -182,6 +182,8 @@ void initAccessor(ValueVector vector) {
accessor = new TimestampAccessor(timeStampMicroTZVector);
} else if (vector instanceof TimeStampMicroVector timeStampMicroVector) {
accessor = new TimestampNTZAccessor(timeStampMicroVector);
} else if (vector instanceof TimeNanoVector timeNanoVector) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a stupid question, when is these method used? why should we change arrow related method when adding connect dataType?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants