Apply filter pushdown to the inner join to avoid to scan all rows in parquet files #432

LantaoJin · 2020-05-19T13:51:55Z

In MergeIntoCommand, it uses a inner join to find touched files. But the targetDF contains some non-deterministic columns (e.g FILE_NAME_COL). That will prevent the target table from applying parquet filter pushdown. So current implementation scans all rows in parquet files of targetDF (has skipped the files which not contains any matched rows). But in the worst case, it still could scan all data in the whole target table. So we need to add a filter before adding withColumn to enable the Parquet filter pushdown. Without this patch, the PushedFilters in FileScanExec is empty. With this patch, all target only predicates could be pushdown to parquet FileScanExec.

…e target table

tdas · 2020-05-19T17:02:04Z

Can you add a unit test to verify whether this is working?

LantaoJin · 2020-05-20T06:49:50Z

Can you add a unit test to verify whether this is working?

The difficult part for unit test is I cannot extract the inner plan information. So I just test it and check it from Spark UI
Before patching:

We can see the filter is added after the two projects (monotonically_increasing_id and input_file_name). They are nondeterministic columns. So the PushedFilter in FileScan is empty.

After patching:

We can see the filter is added before the two projects (monotonically_increasing_id and input_file_name). So the filter we added push down to FileScan. We can see it is in PushedFilter.

tdas · 2020-05-20T19:16:30Z

This is a very good find. Thank you for finding it. I see that this is hard to unit test. ... hmm let me think about how to unit test this.

LantaoJin · 2020-05-28T04:22:23Z

@tdas Do you find a nice way for UT? I've submitted a similar PR #438, would you have a time to look?

LantaoJin · 2020-06-17T07:54:58Z

Gentle ping @tdas @zsxwing @jose-torres @brkyvz

brkyvz · 2020-06-22T23:50:00Z

Hi @LantaoJin, thank you for submitting this PR! By adding SQLQueryListener, you should be able to capture the internal execution of the plan. Then we can compare the metrics on the number of rows read to unit test this.

LantaoJin · 2020-07-09T08:52:40Z

@tdas @brkyvz unit test added.

LantaoJin · 2020-07-16T01:37:30Z

Any more comments? Gentle ping @tdas @brkyvz @zsxwing @gatorsmile

Apply parquet filter pushdown to the inner join to avoid to scan whol…

507bc46

…e target table

LantaoJin changed the title ~~Apply parquet filter pushdown to the inner join to avoid to scan whole target dataset~~ Apply filter pushdown to the inner join to avoid to scan whole rows in parquet files May 19, 2020

LantaoJin changed the title ~~Apply filter pushdown to the inner join to avoid to scan whole rows in parquet files~~ Apply filter pushdown to the inner join to avoid to scan all rows in parquet files May 19, 2020

LantaoJin mentioned this pull request May 28, 2020

Apply filter pushdown to source rows for the right outer join of matched only case #438

Open

LantaoJin added 2 commits July 9, 2020 16:04

Merge remote-tracking branch 'upstream/master' into FilterPushdown

d686e7c

add a unit test

dee5c16

Merge branch 'master' into FilterPushdown

0fff01b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply filter pushdown to the inner join to avoid to scan all rows in parquet files #432

Apply filter pushdown to the inner join to avoid to scan all rows in parquet files #432

Uh oh!

LantaoJin commented May 19, 2020 •

edited

Loading

Uh oh!

tdas commented May 19, 2020

Uh oh!

LantaoJin commented May 20, 2020

Uh oh!

tdas commented May 20, 2020

Uh oh!

LantaoJin commented May 28, 2020

Uh oh!

LantaoJin commented Jun 17, 2020

Uh oh!

brkyvz commented Jun 22, 2020

Uh oh!

LantaoJin commented Jul 9, 2020

Uh oh!

LantaoJin commented Jul 16, 2020 •

edited

Loading

Uh oh!

Uh oh!

Apply filter pushdown to the inner join to avoid to scan all rows in parquet files #432

Are you sure you want to change the base?

Apply filter pushdown to the inner join to avoid to scan all rows in parquet files #432

Uh oh!

Conversation

LantaoJin commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdas commented May 19, 2020

Uh oh!

LantaoJin commented May 20, 2020

Uh oh!

tdas commented May 20, 2020

Uh oh!

LantaoJin commented May 28, 2020

Uh oh!

LantaoJin commented Jun 17, 2020

Uh oh!

brkyvz commented Jun 22, 2020

Uh oh!

LantaoJin commented Jul 9, 2020

Uh oh!

LantaoJin commented Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LantaoJin commented May 19, 2020 •

edited

Loading

LantaoJin commented Jul 16, 2020 •

edited

Loading