-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Clarify date primitive type in PROTOCOL.md #740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
PROTOCOL.md
Outdated
@@ -520,7 +520,7 @@ float| 4-byte single-precision floating-point numbers | |||
double| 8-byte double-precision floating-point numbers | |||
boolean| `true` or `false` | |||
binary| A sequence of binary data. | |||
date| A calendar date, represented as a year-month-day triple without a timezone. | |||
date| A calendar date, represented as a year-month-day triple without a timezone. Stored as a 4 bytes incrementing count of days where day 0 is 1970-01-01. Negative numbers represent earlier days. Range: 0001-01-01 to 9999-12-31. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the memory representation? An implementation can decide any format on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zsxwing for the clarification. For 4 bytes incrementing count
, I was trying to focus on the physical bytes representation in the table files. But I can see why this is not something we want to enforce in the spec since users are free to store table files in any format of their choice.
On that note, how about the semantics of day 0 is 1970-01-01. Negative numbers represent earlier days.
and Range: 0001-01-01 to 9999-12-31
. Are these things that should be enforced in the spec or leave out to individual implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just refer the parquet spec, since these types are just stored in the parquet format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering parquet is only one of the formats we are supporting in delta (https://github.com/delta-io/delta/blob/master/PROTOCOL.md#Format-Specification), I think it would be better to not tie to a specific file format spec.
I updated the PR to remove the description for physical bytes representation. As for the semantics, I think there is value in formally specifying it with the following two options:
- use the existing semantic, i.e.
count 0 is 1970-01-01
- explicitly mention that deltatable doesn't not enforce the semantic and it's up to the application to come up with its own interpretation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer, Internally, stored as a count of days where day 0 is 1970-01-01. Negative numbers represent earlier days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the above comment: Internally, stored as a count of days where day 0 is 1970-01-01. Negative numbers represent earlier days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. The reason I said it's implementation detail because we have at least two types of date representations. For example, for a date partition column, we store it in the string format YYYY-MM-DD
such as 2021-01-01
in AddFile.partitionValues
.
Range: 0001-01-01 to 9999-12-31.
I haven't done the check. Does parquet support 5 digits year, such as 10000-12-31?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet stores date as int32, which is more than enough for 5 digits years :) I also think this should be implementation details as well, therefore I have already removed the range requirement in my previous round of commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have at least two types of date representations.
Any comment for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I missed that. It looks like to account for partition value string serialization, we do need to restrict the range :( The 5 digits year is less of a concern, but I think the bigger problem is parquet date can represent date even earlier than 0001-01-01
. If you all agree this is the case, then I can update the PR to add Range: 0001-01-01 to 9999-12-31.
back to the spec.
@zsxwing @scottsand-db clarified semantics based on the review, please take another look |
No description provided.