Skip to content

fix: remove distinct_counts manifest field #163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 28, 2025

Conversation

gruuya
Copy link
Contributor

@gruuya gruuya commented Apr 26, 2025

Closes #162.

In this PR distinct_counts is removed from both DataFileV1/DataFileV2 schema.

I've tested this change manually and pyiceberg reads now work.

@JanKaul
Copy link
Owner

JanKaul commented Apr 28, 2025

Hi, thanks for the PR. I was wondering if we could also resolve the issue by setting the values to None/Precision::Absent but leave the field as part of the schema. I somehow would like to stay close to the spec even when we're not using the field. It should then finally be resolved by adding the V3 spec implementation.

It actually worries me that pyiceberg doesn't seem to distinguish between different table versions.

@gruuya
Copy link
Contributor Author

gruuya commented Apr 28, 2025

I was wondering if we could also resolve the issue by setting the values to None/Precision::Absent but leave the field as part of the schema.

I just tried setting this

.with_distinct_counts(Some(distinct_counts))

to None, but it didn't work (pyiceberg read fails with same error).

I somehow would like to stay close to the spec even when we're not using the field. It should then finally be resolved by adding the V3 spec implementation.

Yeah that makes sense to me.

It actually worries me that pyiceberg doesn't seem to distinguish between different table versions.

Agreed as well; the thing is this problem can probably be circumvented by constructing the Avro reader manually, but this isn't an option when using the catalog, e.g.

table = catalog.load_table((namespace, table_name))
print(table.scan(limit=1).to_arrow())

which presents a big problem to us.

@gruuya
Copy link
Contributor Author

gruuya commented Apr 28, 2025

I somehow would like to stay close to the spec even when we're not using the field. It should then finally be resolved by adding the V3 spec implementation.

Yeah that makes sense to me.

That said, the java repo doesn't seem to have this field defined either, as if it didn't exist in v1/v2 at all.

So the problem seems to be that the official spec is misaligned with two main implementations (java and python ones), since they effectively stopped using the field long before it got officially deprecated.

@JanKaul
Copy link
Owner

JanKaul commented Apr 28, 2025

Well, in that case there isn't really a choice here. We should be compatible with the other implementations. Thanks for trying though.

@JanKaul JanKaul merged commit 8cb7e01 into JanKaul:main Apr 28, 2025
2 checks passed
@gruuya
Copy link
Contributor Author

gruuya commented Apr 28, 2025

Well, in that case there isn't really a choice here. We should be compatible with the other implementations. Thanks for trying though.

Thanks!

The root of the problem as stated above is there is a mismatch between the spec and the reference (java) as well as the main supporting (python, apache rust) implementations.

From what I can tell, the community decided to treat the field as if it never was part of the spec (even though it still formally is in v1/v2), and thus there is a conundrum now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distinct_counts manifest field is deprecated and conflicts with pyiceberg reads
2 participants