Apply filter pushdown to source rows for the right outer join of matched only case #438

LantaoJin · 2020-05-28T04:07:43Z

The reason for this optimization is similar to #432.

In matched only case, we use a right outer join between source and target to write the changes.
Due to the non-deterministic UDF makeMetricUpdateUDF, the predicate pushdown for source rows is not applied. This PR is manually adding a filter before Projection with the non-deterministic UDF to trigger filter pushdown.

Besides the performance improvement by filter pushdown, without this, Spark Driver may easily trigger frequent full GC problem if the source table contains mass files:
(in our inner version, we use target left outer join source instead of source right out join target, so the right side in the below graphs is source table)

From the Class Histogram when Spark Driver full GC, we can see 3.8 millionSerializableFileStatus which basically matches the file count in the source table in above graphs. This hold memory could not be GC during the join processing.

2020-05-24T 13:11:40.996-0700: [Full GC (Allocation Failure) 49G->39G(50G), 49.8580399 secs]
[Eden: 0.0B(2528.0M)->0.0B(2560.0M) Survivors: 32.0M->0.0B Heap: 49.4G(50.0G)->39.9G(50.0G)], [Metaspace: 201466K->200916K(239616K)]
2020-05-24T13:12:30.854-0700: [Class Histogram (after full gc):
num #instances #bytes class name
----------------------------------------------
1: 148517177 27413443128 [C
2: 148514960 4752478720 java.lang.String
3: 22946326 3304270944 java.net.URI
4: 20371798 2118666992 org.apache.hadoop.fs.LocatedFileStatus
5: 20383330 978399840 org.apache.hadoop.fs.permission.FsPermission
6: 22945656 550695744 org.apache.hadoop.fs.Path
7: 20371798 488923152 [Lorg.apache.hadoop.fs.BlockLocation;
8: 51688 344773816 [B
9: 3876863 279134136 org.apache.spark.sql.execution.datasources.PartitionedFile
10: 3806959 274101048 org.apache.spark.sql.execution.datasources.InMemoryFileIndex$SerializableFileStatus
11: 2561048 245860608 org.apache.hadoop.fs.FileStatus
12: 647901 186946488 [Lscala.collection.mutable.HashEntry;

After applied this optimization, the frequent full GC problem caused by this scenario had gone. And the performance of this out join was greatly improved.

LantaoJin · 2020-06-17T07:55:52Z

Gentle ping @tdas @zsxwing @jose-torres @brkyvz

LantaoJin · 2020-07-16T01:39:52Z

Any comments? Gentle ping @tdas @brkyvz @zsxwing @gatorsmile

GrigorievNick · 2021-08-05T09:43:40Z

Any reason why this was not merged or at least reviewed?
I check this branch with my production job, and get a big performance and resource usage boost?

jaceklaskowski

LGTM

jaceklaskowski · 2021-08-05T10:55:38Z

src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala

@@ -214,7 +214,7 @@ case class MergeIntoCommand(
  private def isMatchedOnly: Boolean = notMatchedClauses.isEmpty && matchedClauses.nonEmpty

  override lazy val metrics = Map[String, SQLMetric](
-    "numSourceRows" -> createMetric(sc, "number of source rows"),
+    "numSourceRows" -> createMetric(sc, "number of source rows participated in merge"),


nit: s/participated in merge/involved

Apply filter pushdown to the right outer join in matched only case

7dafe2e

LantaoJin mentioned this pull request May 28, 2020

Apply filter pushdown to the inner join to avoid to scan all rows in parquet files #432

Open

Merge branch 'master' into FilterPushdownSource

d030090

LantaoJin added 2 commits November 18, 2020 11:00

Merge branch 'master' into FilterPushdownSource

390c09c

fix conflicts

a7ff574

jaceklaskowski approved these changes Aug 5, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply filter pushdown to source rows for the right outer join of matched only case #438

Apply filter pushdown to source rows for the right outer join of matched only case #438

Uh oh!

LantaoJin commented May 28, 2020 •

edited

Loading

Uh oh!

LantaoJin commented Jun 17, 2020 •

edited

Loading

Uh oh!

LantaoJin commented Jul 16, 2020 •

edited

Loading

Uh oh!

GrigorievNick commented Aug 5, 2021

Uh oh!

jaceklaskowski left a comment

Uh oh!

jaceklaskowski Aug 5, 2021

Uh oh!

Uh oh!

Apply filter pushdown to source rows for the right outer join of matched only case #438

Are you sure you want to change the base?

Apply filter pushdown to source rows for the right outer join of matched only case #438

Uh oh!

Conversation

LantaoJin commented May 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LantaoJin commented Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LantaoJin commented Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrigorievNick commented Aug 5, 2021

Uh oh!

jaceklaskowski left a comment

Choose a reason for hiding this comment

Uh oh!

jaceklaskowski Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LantaoJin commented May 28, 2020 •

edited

Loading

LantaoJin commented Jun 17, 2020 •

edited

Loading

LantaoJin commented Jul 16, 2020 •

edited

Loading