Fix double file scan from nested schema pruning #1096

Kimahriman · 2022-04-22T22:08:58Z

Description

When comparing DeltaScan filters, un-resolves and re-resolves nested field extractions from the source expression set to the target expression set. This is to get around the fact the the prepared filters are pre-optimization, and the actual filters are post optimization that can involve nested schema pruning.

Resolves #1073

How was this patch tested?

New UT.

Does this PR introduce any user-facing changes?

Yes, removes a double file scan for a scan with filters on nested columns.

scottsand-db · 2022-04-25T20:30:22Z

@Kimahriman thanks for this PR, will need to take a thorough look.

Regarding, "Not sure how to test the actual behavior, so I just added a test for the helper method that demonstrates the issue.". Since you've stated this only happens when we apply a scan which filters on nested columns, can you write a unit test that previously had such a double scan, and then use getScanReport to assert it only scans once?

Kimahriman · 2022-04-25T20:57:16Z

Yeah this was just kinda brainstorming an approach. At first I tried to just fully prune any expressions to only the field it cared about, but then just hit weird cases of a isnotnull(nested) and what to do about that because it's not extracting any fields. Theoretically I think this should be safe since all column names should be unique, but definitely let me know what you think or if you can come up with a better idea.

Regarding, "Not sure how to test the actual behavior, so I just added a test for the helper method that demonstrates the issue.". Since you've stated this only happens when we apply a scan which filters on nested columns, can you write a unit test that previously had such a double scan, and then use getScanReport to assert it only scans once?

Didn't know that was a thing so I'll look into that.

Kimahriman · 2022-04-25T22:53:48Z

I guess another strategy could be gather all attributes from the actual filters and replace them in the prepared filters by expr ID?

scottsand-db · 2022-04-25T23:07:48Z

I guess another strategy could be gather all attributes from the actual filters and replace them in the prepared filters by expr ID?

This seems like a more robust solution to me. What do you think @zsxwing ?

Kimahriman · 2022-04-25T23:16:11Z

Also, I don't think getScanReport can catch this (after trying and seeing it not work). There's still only one FileSourceScanExec, it just does double duty in the matchingFiles method

Kimahriman · 2022-04-25T23:31:32Z

Ah shoot not as easy as I thought, the ordinals for the struct fields will be off after that. Need to think through that alternative a little more

Kimahriman · 2022-04-26T11:39:20Z

I think you would essentially have to "reresolve" the prepared filters so to speak, so the options I can think of right now are basically:

This approach, convert prepared and actual to unresolved
Convert the prepared to unresolved, and then try to re-resolve them with the actual attributes

tdas · 2022-05-10T18:02:56Z

I am trying catch up to the discussion and think about it a little. my biggest fear is that if we attempt to duplicate any resolution / unresolution code path, then that duplicate code path can easily become inconsistent with the actual nested schema resolution logic. hence bugs leading to incorrect set of files being read. That why till now we have erred toward reading more data and let it be filtered by the filter plan than reading less data.

Kimahriman · 2022-05-10T18:19:40Z

The "unresolve both sides" seems less likely to have any weird resolution issues, but obviously just takes one example to prove it doesn't work. Haven't thought of one yet though

allisonport-db · 2022-05-14T02:00:12Z

So we spent some time looking into the issue a little bit more. A few solutions we're considering include

Comparing AttributeReferences using their exprId
Applying the rule that does the nested schema pruning to the prepared filters before checking equality
Inject PrepareDeltaScan somewhere else (our options might be very limited here)

As a next step we want to identify the rule that does the nested schema pruning (which wasn't immediately obvious from a precursory look.) To figure this out, I'm planning to set spark.sql.planChangeLog.level to warn and try to find the rule. I'm not sure when I'll have time to do this, so feel free to look into it yourself and investigate the plausibility of (2).

Kimahriman · 2022-05-15T12:59:13Z

I believe it's the NestedColumnAliasing extractor in the ColumnPruning rule. So elaborating on the first two, these seem like the possibilities:

Expand on the current method a bit. Take all the AttributeReference's is the actual filters, replace them by exprId in the prepared filters, and then transform all GetStructField's to UnresolvedExtractValue on both side. This still removes data type comparisons of the nested field, but you do have the AttributeReference's to compare against at least. Alternatively, you could try to re-resolve the prepared filters with the replaced AttributeReference's, but not sure how stable that would be. Maybe as simple as transforming GetStructField to an ExtractValue post-attribute replacement?
Try to run the ColumnPruning optimization rule on both prepared and actual filters, though I'm not sure how that would work since that works on plans and not expressions.

Kimahriman · 2022-05-15T13:38:56Z

Tried the first method, let me know what you think

allisonport-db · 2022-06-09T22:27:25Z

Hey @Kimahriman just wanted to provide an update. I didn't forget about this, we are just very busy with the upcoming release. Will be taking another look as soon as I have the chance.

Kimahriman · 2022-06-09T22:39:40Z

No problem! I'd rather have a quicker release for Spark 3.3 😊

Kimahriman · 2022-06-18T13:40:12Z

Figured out a test using withLogicalPlansCaptured to ensure there's only one log scan

scottsand-db · 2022-06-21T23:27:44Z

Hi @Kimahriman - awesome! We will finish reviewing + merging this PR after the next Delta release, which we are working hard at now. Will get back to you within 1 or 2 weeks, cheers!

allisonport-db · 2022-07-28T01:41:01Z

The test looks good, thanks for adding that. As for the current solution I'm not completely convinced on its robustness/safeness, and I'd rather err on the side of rescanning unnecessarily. It would be much better if we could find a way to reuse the codepaths that are in the rule doing the pruning. I did take a look though and I see how that's not super accessible.

Kimahriman · 2022-07-28T12:16:04Z

As for the current solution I'm not completely convinced on its robustness/safeness, and I'd rather err on the side of rescanning unnecessarily.

Has any scenario been thought of yet where the post-optimization filters actually change? The only thing that should happen between creating the initial scan and the final scan is optimization rules are applied. Optimization rules shouldn't change the result of the query, just the performance of the query. In this case, theoretically the only thing that should change with the filters is that more filters are applied post-optimization. If files are included post-optimization that weren't included pre-optimization, that would indicate some serious bug unrelated to this PR imo.

So in the case of a false positive, where we think the filters are the same with this logic but they are not actually, the worst thing that could happen is we use the pre-optimization scan that could include more files than necessary. And in the false negative case, we will simply recalculate the files for the scan regardless, so it's just the same performance hit of the extra log scan.

Is there anything I'm missing or not thinking of?

…g mismatches

Kimahriman force-pushed the bug/stats-nested branch from eede681 to 7619932 Compare April 22, 2022 22:10

zsxwing requested a review from allisonport-db April 26, 2022 17:04

Kimahriman force-pushed the bug/stats-nested branch from 7619932 to 5fc528a Compare April 27, 2022 11:44

scottsand-db requested a review from tdas May 5, 2022 19:15

zsxwing assigned tdas May 10, 2022

Kimahriman force-pushed the bug/stats-nested branch from 2656160 to b15e895 Compare May 6, 2023 17:25

Kimahriman force-pushed the bug/stats-nested branch from b15e895 to fa4d602 Compare October 3, 2023 13:33

Kimahriman added 5 commits October 3, 2023 09:49

Convert nested fields to UnresolvedAttribute's to avoid schema prunin…

525d920

…g mismatches

Re-resolve prepared filters on actual attributes

1789b66

Add a test ensuring only one log scan

79cd633

Remove unused schema

f88e9ed

Fix check on test

e96918e

Kimahriman force-pushed the bug/stats-nested branch from fa4d602 to e96918e Compare October 3, 2023 13:49

Merge branch 'master' into bug/stats-nested

9c61748

Fix double file scan from nested schema pruning #1096

Are you sure you want to change the base?

Fix double file scan from nested schema pruning #1096

Uh oh!

Conversation

Kimahriman commented Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Uh oh!

scottsand-db commented Apr 25, 2022

Uh oh!

Kimahriman commented Apr 25, 2022

Uh oh!

Kimahriman commented Apr 25, 2022

Uh oh!

scottsand-db commented Apr 25, 2022

Uh oh!

Kimahriman commented Apr 25, 2022

Uh oh!

Kimahriman commented Apr 25, 2022

Uh oh!

Kimahriman commented Apr 26, 2022

Uh oh!

tdas commented May 10, 2022

Uh oh!

Kimahriman commented May 10, 2022

Uh oh!

allisonport-db commented May 14, 2022

Uh oh!

Kimahriman commented May 15, 2022

Uh oh!

Kimahriman commented May 15, 2022

Uh oh!

allisonport-db commented Jun 9, 2022

Uh oh!

Kimahriman commented Jun 9, 2022

Uh oh!

Kimahriman commented Jun 18, 2022

Uh oh!

scottsand-db commented Jun 21, 2022

Uh oh!

allisonport-db commented Jul 28, 2022

Uh oh!

Kimahriman commented Jul 28, 2022

Uh oh!

Uh oh!

Kimahriman commented Apr 22, 2022 •

edited

Loading