Skip to content

Fix double file scan from nested schema pruning #1096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Apr 22, 2022

Description

When comparing DeltaScan filters, un-resolves and re-resolves nested field extractions from the source expression set to the target expression set. This is to get around the fact the the prepared filters are pre-optimization, and the actual filters are post optimization that can involve nested schema pruning.

Resolves #1073

How was this patch tested?

New UT.

Does this PR introduce any user-facing changes?

Yes, removes a double file scan for a scan with filters on nested columns.

@scottsand-db
Copy link
Collaborator

@Kimahriman thanks for this PR, will need to take a thorough look.

Regarding, "Not sure how to test the actual behavior, so I just added a test for the helper method that demonstrates the issue.". Since you've stated this only happens when we apply a scan which filters on nested columns, can you write a unit test that previously had such a double scan, and then use getScanReport to assert it only scans once?

@Kimahriman
Copy link
Contributor Author

Yeah this was just kinda brainstorming an approach. At first I tried to just fully prune any expressions to only the field it cared about, but then just hit weird cases of a isnotnull(nested) and what to do about that because it's not extracting any fields. Theoretically I think this should be safe since all column names should be unique, but definitely let me know what you think or if you can come up with a better idea.

Regarding, "Not sure how to test the actual behavior, so I just added a test for the helper method that demonstrates the issue.". Since you've stated this only happens when we apply a scan which filters on nested columns, can you write a unit test that previously had such a double scan, and then use getScanReport to assert it only scans once?

Didn't know that was a thing so I'll look into that.

@Kimahriman
Copy link
Contributor Author

I guess another strategy could be gather all attributes from the actual filters and replace them in the prepared filters by expr ID?

@scottsand-db
Copy link
Collaborator

I guess another strategy could be gather all attributes from the actual filters and replace them in the prepared filters by expr ID?

This seems like a more robust solution to me. What do you think @zsxwing ?

@Kimahriman
Copy link
Contributor Author

Also, I don't think getScanReport can catch this (after trying and seeing it not work). There's still only one FileSourceScanExec, it just does double duty in the matchingFiles method

@Kimahriman
Copy link
Contributor Author

Ah shoot not as easy as I thought, the ordinals for the struct fields will be off after that. Need to think through that alternative a little more

@Kimahriman
Copy link
Contributor Author

I think you would essentially have to "reresolve" the prepared filters so to speak, so the options I can think of right now are basically:

  • This approach, convert prepared and actual to unresolved
  • Convert the prepared to unresolved, and then try to re-resolve them with the actual attributes

@tdas
Copy link
Contributor

tdas commented May 10, 2022

I am trying catch up to the discussion and think about it a little. my biggest fear is that if we attempt to duplicate any resolution / unresolution code path, then that duplicate code path can easily become inconsistent with the actual nested schema resolution logic. hence bugs leading to incorrect set of files being read. That why till now we have erred toward reading more data and let it be filtered by the filter plan than reading less data.

@Kimahriman
Copy link
Contributor Author

The "unresolve both sides" seems less likely to have any weird resolution issues, but obviously just takes one example to prove it doesn't work. Haven't thought of one yet though

@allisonport-db
Copy link
Collaborator

So we spent some time looking into the issue a little bit more. A few solutions we're considering include

  1. Comparing AttributeReferences using their exprId
  2. Applying the rule that does the nested schema pruning to the prepared filters before checking equality
  3. Inject PrepareDeltaScan somewhere else (our options might be very limited here)

As a next step we want to identify the rule that does the nested schema pruning (which wasn't immediately obvious from a precursory look.) To figure this out, I'm planning to set spark.sql.planChangeLog.level to warn and try to find the rule. I'm not sure when I'll have time to do this, so feel free to look into it yourself and investigate the plausibility of (2).

@Kimahriman
Copy link
Contributor Author

I believe it's the NestedColumnAliasing extractor in the ColumnPruning rule. So elaborating on the first two, these seem like the possibilities:

  1. Expand on the current method a bit. Take all the AttributeReference's is the actual filters, replace them by exprId in the prepared filters, and then transform all GetStructField's to UnresolvedExtractValue on both side. This still removes data type comparisons of the nested field, but you do have the AttributeReference's to compare against at least. Alternatively, you could try to re-resolve the prepared filters with the replaced AttributeReference's, but not sure how stable that would be. Maybe as simple as transforming GetStructField to an ExtractValue post-attribute replacement?
  2. Try to run the ColumnPruning optimization rule on both prepared and actual filters, though I'm not sure how that would work since that works on plans and not expressions.

@Kimahriman
Copy link
Contributor Author

Tried the first method, let me know what you think

@allisonport-db
Copy link
Collaborator

Hey @Kimahriman just wanted to provide an update. I didn't forget about this, we are just very busy with the upcoming release. Will be taking another look as soon as I have the chance.

@Kimahriman
Copy link
Contributor Author

No problem! I'd rather have a quicker release for Spark 3.3 😊

@Kimahriman
Copy link
Contributor Author

Figured out a test using withLogicalPlansCaptured to ensure there's only one log scan

@scottsand-db
Copy link
Collaborator

Hi @Kimahriman - awesome! We will finish reviewing + merging this PR after the next Delta release, which we are working hard at now. Will get back to you within 1 or 2 weeks, cheers!

@allisonport-db
Copy link
Collaborator

The test looks good, thanks for adding that. As for the current solution I'm not completely convinced on its robustness/safeness, and I'd rather err on the side of rescanning unnecessarily. It would be much better if we could find a way to reuse the codepaths that are in the rule doing the pruning. I did take a look though and I see how that's not super accessible.

@Kimahriman
Copy link
Contributor Author

As for the current solution I'm not completely convinced on its robustness/safeness, and I'd rather err on the side of rescanning unnecessarily.

Has any scenario been thought of yet where the post-optimization filters actually change? The only thing that should happen between creating the initial scan and the final scan is optimization rules are applied. Optimization rules shouldn't change the result of the query, just the performance of the query. In this case, theoretically the only thing that should change with the filters is that more filters are applied post-optimization. If files are included post-optimization that weren't included pre-optimization, that would indicate some serious bug unrelated to this PR imo.

So in the case of a false positive, where we think the filters are the same with this logic but they are not actually, the worst thing that could happen is we use the pre-optimization scan that could include more files than necessary. And in the false negative case, we will simply recalculate the files for the scan regardless, so it's just the same performance hit of the extra log scan.

Is there anything I'm missing or not thinking of?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Double file scan with stats skipping
4 participants