Description
Because of the String -> Fixed Binary conversion the readers and writers are both incorrect.
The vectorized reader initializes a FixedBinary reader on a column we report is a String causing an unsupported reader exception.
java.lang.UnsupportedOperationException: Unsupported type: UTF8String
at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:82)
at org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:140)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Sour
The writer is broken because it gets String Columns from Spark but needs to write fixed binary.
Something like this needed as a fix
private static PrimitiveWriter<UTF8String> uuids(ColumnDescriptor desc) {
return new UUIDWriter(desc);
}
private static class UUIDWriter extends PrimitiveWriter<UTF8String> {
private ByteBuffer buffer = ByteBuffer.allocate(16);
private UUIDWriter(ColumnDescriptor desc) {
super(desc);
}
@Override
public void write(int repetitionLevel, UTF8String string) {
UUID uuid = UUID.fromString(string.toString());
buffer.rewind();
buffer.putLong(uuid.getMostSignificantBits());
buffer.putLong(uuid.getLeastSignificantBits());
buffer.rewind();
column.writeBinary(repetitionLevel, Binary.fromReusedByteBuffer(buffer));
}
}
Metadata
Metadata
Assignees
Labels
No labels
Activity
RussellSpitzer commentedon Apr 18, 2022
I'm working on a PR for this
RussellSpitzer commentedon Apr 18, 2022
Actually after looking at this for a while I think we should probably just always handle UUID as binary type in Spark rather than trying to do a String conversion. I think if the end user needs the string representation they can always cast?
pvary commentedon Apr 26, 2022
@RussellSpitzer: I am not sure that it is still an actual issue - or it was fixed in the current code, but I have found a year ago that Parquet and ORC/Avro expects UUID differently for writes. See: #1881
And this is even before the Spark code 😄
RussellSpitzer commentedon Apr 26, 2022
I have kind of punted on this, I was just toying around with it for a types test and don't have any users yet 🤞 who have hit this
wobrycki commentedon Jul 15, 2022
Hi @RussellSpitzer ,
when we want to read a Spark df using iceberg, and this df has UUID column type,
while exporting this df to CSV we got an error:
Can this be related to this issue?
We could reproduce that by having a table with only one column of UUID type.
RussellSpitzer commentedon Jul 15, 2022
Yep, currently the Spark code cannot read or write UUID correctly.
github-actions commentedon Jan 23, 2023
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
wobrycki commentedon Mar 20, 2023
@RussellSpitzer could you point us to the code, where the adjustments are needed?
Should it be
SparkValueReaders.java
andSparkValueWriters.java
?I guess there is no test that writes the UUID column as it is not working yet, so we should probably adjust the tests like
TestSparkParquetWriter
as well.Normally uuids(ColumnDescriptor desc) returns an instance of ValueWriter and not PrimitiveWriter, what is the reason we are going for PrimitiveWriter here?
RussellSpitzer commentedon Mar 20, 2023
Sorry we totally deprioritized this so I haven't looked at it in a while so I couldn't tell you why.
6 remaining items