-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix double file scan from nested schema pruning #1096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
eede681
to
7619932
Compare
@Kimahriman thanks for this PR, will need to take a thorough look. Regarding, "Not sure how to test the actual behavior, so I just added a test for the helper method that demonstrates the issue.". Since you've stated this only happens when we apply a scan which filters on nested columns, can you write a unit test that previously had such a double scan, and then use |
Yeah this was just kinda brainstorming an approach. At first I tried to just fully prune any expressions to only the field it cared about, but then just hit weird cases of a
Didn't know that was a thing so I'll look into that. |
I guess another strategy could be gather all attributes from the actual filters and replace them in the prepared filters by expr ID? |
This seems like a more robust solution to me. What do you think @zsxwing ? |
Also, I don't think |
Ah shoot not as easy as I thought, the ordinals for the struct fields will be off after that. Need to think through that alternative a little more |
I think you would essentially have to "reresolve" the prepared filters so to speak, so the options I can think of right now are basically:
|
7619932
to
5fc528a
Compare
I am trying catch up to the discussion and think about it a little. my biggest fear is that if we attempt to duplicate any resolution / unresolution code path, then that duplicate code path can easily become inconsistent with the actual nested schema resolution logic. hence bugs leading to incorrect set of files being read. That why till now we have erred toward reading more data and let it be filtered by the filter plan than reading less data. |
The "unresolve both sides" seems less likely to have any weird resolution issues, but obviously just takes one example to prove it doesn't work. Haven't thought of one yet though |
So we spent some time looking into the issue a little bit more. A few solutions we're considering include
As a next step we want to identify the rule that does the nested schema pruning (which wasn't immediately obvious from a precursory look.) To figure this out, I'm planning to set spark.sql.planChangeLog.level to warn and try to find the rule. I'm not sure when I'll have time to do this, so feel free to look into it yourself and investigate the plausibility of (2). |
I believe it's the
|
Tried the first method, let me know what you think |
Hey @Kimahriman just wanted to provide an update. I didn't forget about this, we are just very busy with the upcoming release. Will be taking another look as soon as I have the chance. |
No problem! I'd rather have a quicker release for Spark 3.3 😊 |
Figured out a test using |
Hi @Kimahriman - awesome! We will finish reviewing + merging this PR after the next Delta release, which we are working hard at now. Will get back to you within 1 or 2 weeks, cheers! |
The test looks good, thanks for adding that. As for the current solution I'm not completely convinced on its robustness/safeness, and I'd rather err on the side of rescanning unnecessarily. It would be much better if we could find a way to reuse the codepaths that are in the rule doing the pruning. I did take a look though and I see how that's not super accessible. |
Has any scenario been thought of yet where the post-optimization filters actually change? The only thing that should happen between creating the initial scan and the final scan is optimization rules are applied. Optimization rules shouldn't change the result of the query, just the performance of the query. In this case, theoretically the only thing that should change with the filters is that more filters are applied post-optimization. If files are included post-optimization that weren't included pre-optimization, that would indicate some serious bug unrelated to this PR imo. So in the case of a false positive, where we think the filters are the same with this logic but they are not actually, the worst thing that could happen is we use the pre-optimization scan that could include more files than necessary. And in the false negative case, we will simply recalculate the files for the scan regardless, so it's just the same performance hit of the extra log scan. Is there anything I'm missing or not thinking of? |
2656160
to
b15e895
Compare
b15e895
to
fa4d602
Compare
fa4d602
to
e96918e
Compare
Description
When comparing DeltaScan filters, un-resolves and re-resolves nested field extractions from the source expression set to the target expression set. This is to get around the fact the the prepared filters are pre-optimization, and the actual filters are post optimization that can involve nested schema pruning.
Resolves #1073
How was this patch tested?
New UT.
Does this PR introduce any user-facing changes?
Yes, removes a double file scan for a scan with filters on nested columns.