Skip to content

Apply filter pushdown to the inner join to avoid to scan all rows in parquet files #432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

LantaoJin
Copy link
Contributor

@LantaoJin LantaoJin commented May 19, 2020

In MergeIntoCommand, it uses a inner join to find touched files. But the targetDF contains some non-deterministic columns (e.g FILE_NAME_COL). That will prevent the target table from applying parquet filter pushdown. So current implementation scans all rows in parquet files of targetDF (has skipped the files which not contains any matched rows). But in the worst case, it still could scan all data in the whole target table. So we need to add a filter before adding withColumn to enable the Parquet filter pushdown. Without this patch, the PushedFilters in FileScanExec is empty. With this patch, all target only predicates could be pushdown to parquet FileScanExec.

@LantaoJin LantaoJin changed the title Apply parquet filter pushdown to the inner join to avoid to scan whole target dataset Apply filter pushdown to the inner join to avoid to scan whole rows in parquet files May 19, 2020
@LantaoJin LantaoJin changed the title Apply filter pushdown to the inner join to avoid to scan whole rows in parquet files Apply filter pushdown to the inner join to avoid to scan all rows in parquet files May 19, 2020
@tdas
Copy link
Contributor

tdas commented May 19, 2020

Can you add a unit test to verify whether this is working?

@LantaoJin
Copy link
Contributor Author

Can you add a unit test to verify whether this is working?

The difficult part for unit test is I cannot extract the inner plan information. So I just test it and check it from Spark UI
Before patching:
Screen Shot 2020-05-20 at 2 39 57 PM

We can see the filter is added after the two projects (monotonically_increasing_id and input_file_name). They are nondeterministic columns. So the PushedFilter in FileScan is empty.

After patching:
Screen Shot 2020-05-20 at 2 45 26 PM

We can see the filter is added before the two projects (monotonically_increasing_id and input_file_name). So the filter we added push down to FileScan. We can see it is in PushedFilter.

@tdas
Copy link
Contributor

tdas commented May 20, 2020

This is a very good find. Thank you for finding it. I see that this is hard to unit test. ... hmm let me think about how to unit test this.

@LantaoJin
Copy link
Contributor Author

@tdas Do you find a nice way for UT? I've submitted a similar PR #438, would you have a time to look?

@LantaoJin
Copy link
Contributor Author

Gentle ping @tdas @zsxwing @jose-torres @brkyvz

@brkyvz
Copy link
Collaborator

brkyvz commented Jun 22, 2020

Hi @LantaoJin, thank you for submitting this PR! By adding SQLQueryListener, you should be able to capture the internal execution of the plan. Then we can compare the metrics on the number of rows read to unit test this.

@LantaoJin
Copy link
Contributor Author

@tdas @brkyvz unit test added.

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Jul 16, 2020

Any more comments? Gentle ping @tdas @brkyvz @zsxwing @gatorsmile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants