Skip to content

Spark: Cannot read or write UUID columns #4581

@RussellSpitzer

Description

@RussellSpitzer
Member

Because of the String -> Fixed Binary conversion the readers and writers are both incorrect.

The vectorized reader initializes a FixedBinary reader on a column we report is a String causing an unsupported reader exception.

java.lang.UnsupportedOperationException: Unsupported type: UTF8String
	at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:82)
	at org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:140)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Sour

The writer is broken because it gets String Columns from Spark but needs to write fixed binary.

Something like this needed as a fix

  private static PrimitiveWriter<UTF8String> uuids(ColumnDescriptor desc) {
    return new UUIDWriter(desc);
  }

  private static class UUIDWriter extends PrimitiveWriter<UTF8String> {
    private ByteBuffer buffer = ByteBuffer.allocate(16);

    private UUIDWriter(ColumnDescriptor desc) {
      super(desc);
    }

    @Override
    public void write(int repetitionLevel, UTF8String string) {
      UUID uuid = UUID.fromString(string.toString());
      buffer.rewind();
      buffer.putLong(uuid.getMostSignificantBits());
      buffer.putLong(uuid.getLeastSignificantBits());
      buffer.rewind();
      column.writeBinary(repetitionLevel, Binary.fromReusedByteBuffer(buffer));
    }
  }

Activity

RussellSpitzer

RussellSpitzer commented on Apr 18, 2022

@RussellSpitzer
MemberAuthor

I'm working on a PR for this

RussellSpitzer

RussellSpitzer commented on Apr 18, 2022

@RussellSpitzer
MemberAuthor

Actually after looking at this for a while I think we should probably just always handle UUID as binary type in Spark rather than trying to do a String conversion. I think if the end user needs the string representation they can always cast?

pvary

pvary commented on Apr 26, 2022

@pvary
Contributor

@RussellSpitzer: I am not sure that it is still an actual issue - or it was fixed in the current code, but I have found a year ago that Parquet and ORC/Avro expects UUID differently for writes. See: #1881

And this is even before the Spark code 😄

RussellSpitzer

RussellSpitzer commented on Apr 26, 2022

@RussellSpitzer
MemberAuthor

I have kind of punted on this, I was just toying around with it for a types test and don't have any users yet 🤞 who have hit this

wobrycki

wobrycki commented on Jul 15, 2022

@wobrycki

Hi @RussellSpitzer ,

when we want to read a Spark df using iceberg, and this df has UUID column type,
while exporting this df to CSV we got an error:

py4j.protocol.Py4JJavaError: An error occurred while calling o669.csv.
: org.apache.spark.SparkException: Job aborted.
        at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
(...)
        ... 1 moreCaused by: java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.unsafe.types.UTF8String ([B is in module java.base of loader 'bootstrap'; org.apache.spa
rk.unsafe.types.UTF8String is in unnamed module of loader 'app')
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:46)
        at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

Can this be related to this issue?
We could reproduce that by having a table with only one column of UUID type.

RussellSpitzer

RussellSpitzer commented on Jul 15, 2022

@RussellSpitzer
MemberAuthor

Yep, currently the Spark code cannot read or write UUID correctly.

github-actions

github-actions commented on Jan 23, 2023

@github-actions

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

wobrycki

wobrycki commented on Mar 20, 2023

@wobrycki

@RussellSpitzer could you point us to the code, where the adjustments are needed?
Should it be SparkValueReaders.java and SparkValueWriters.java?

I guess there is no test that writes the UUID column as it is not working yet, so we should probably adjust the tests like TestSparkParquetWriter as well.

Normally uuids(ColumnDescriptor desc) returns an instance of ValueWriter and not PrimitiveWriter, what is the reason we are going for PrimitiveWriter here?

RussellSpitzer

RussellSpitzer commented on Mar 20, 2023

@RussellSpitzer
MemberAuthor

Sorry we totally deprioritized this so I haven't looked at it in a while so I couldn't tell you why.

self-assigned this
on Apr 25, 2023

6 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Participants

    @nastra@RussellSpitzer@Fokko@pvary@wobrycki

    Issue actions

      Spark: Cannot read or write UUID columns · Issue #4581 · apache/iceberg