-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Further refactor Parquet readers for v2 support #13290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
this.vectorizedDefinitionLevelReader = null; | ||
} | ||
|
||
@Override | ||
protected void initDataReader(Encoding dataEncoding, ByteBufferInputStream in, int valueCount) { | ||
ValuesReader previousReader = plainValuesReader; | ||
ValuesReader previousReader = (ValuesReader) valuesReader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicit cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah currently we need the cast due to a bit of a Java-based pickle, happy to change this up if you have any ideas on how it can be improved.
VectorizedValuesReader
is currently not bound toValuesReader
. We can't just makeVectorizedValuesReader
extendValuesReader
because the former is an interface and the latter is a class.- We can't just make
VectorizedValuesReader
an ABC because then classes likeVectorizedPlainValuesReader
can't extend both it andPlainValuesReader
. - Since
ValuesReader
is an ABC in Parquet butVectorizedValuesReader
is an interface, we can't easily markvaluesReader
as being both types.- We could create a generic method and bind the type to
<T extends ValuesReader & VectorizedValuesReader>
but that feels clunky
- We could create a generic method and bind the type to
I see a couple of ways out of this but they mostly involve copying a lot of code and I thought this cast wasn't too high of a price to pay. It does mean that we need to more or less keep VectorizedValuesReader
in sync with ValuesReader
if it evolves.
(This cast is inherited from #9772)
...w/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedPlainValuesReader.java
Outdated
Show resolved
Hide resolved
import org.apache.parquet.io.api.Binary; | ||
|
||
/** Interface for value decoding that supports vectorized (aka batched) decoding. */ | ||
interface VectorizedValuesReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Spark corresponding class there are some other methods available, do we need those as well? Part of a followup or un-needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Spark expert can correct me, but I think the read methods with a WritableColumnVector
parameter are for copying the data into a Spark WritableColumnVector
. In Iceberg we use Arrow vectors instead of WritableColumnVector
s. The skip methods are for Parquet page skipping. For the v2 read support, they are not needed. I'd like to resurrect my work on Parquet page skipping, so I may look into leveraging those skip methods myself, but for now, we don't need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's exactly correct. I played with the idea of implementing something like WritableColumnVector
within Iceberg (but for building something like a VectorizedColumnIterator
), but I think we won't need all of that code after all.
I have some small questions about the Roadmap for where we go from here but this makes sense to me as a first step. As long as we are more or less copying the Spark approach I think we are probably safe here. @huaxingao Could you do a quick check as well? |
Some weird rebase happened here, git history looks scary now :) |
a63ee5d
to
9ecc2be
Compare
…et-v2-refactor
|
||
/* | ||
* Reads `total` values into `vec` start at `vec[rowId]` | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Shall we add Javadoc to all the methods for consistency? Spark usually uses block comments for similar APIs, but in those cases, the methods are grouped without blank lines. Since these are separated, it feels more like they’re meant to be documented individually. Not sure what the typical style is in Iceberg though.
@eric-maynard Thanks for the PR! The approach looks good to me and seems like a reasonable first step. |
In issues like #7162 and #11371, it's reported that newer Parquet encodings like DELTA_BINARY_PACKED don't work with the current Parquet readers. #11661 recently refactored the Parquet readers to improve code re-use, but there a few more changes needed to prepare us for Parquet v2 support.
This refactor introduces a new interface
VectorizedValuesReader
and changes readers likeTimestampMillisReader
to work with this new type. After this change, new implementations ofVectorizedValuesReader
can be added to support encodings like DELTA_BINARY_PACKED.This PR is a revival of @wgtmac's #9772, which based on our conversion he will not be able to continue work on. Thanks for the great work, @wgtmac.