Introduce a new conf for vacuum parallel listing #886

JassAbidi · 2022-01-09T22:11:10Z

Currently, vacuum uses spark.sql.sources.parallelPartitionDiscovery.parallelism to control the parallelism for files and directories listing for vacuum and the default value is 10000.
This PR introduces a new Delta SQL conf DELTA_VACUUM_FILE_LISTING_PARALLELISM to control the file listing parallelism for vacuum command. The default value for this conf is 200.

Closes #859

[SC-24892] Add typesafe bintray repo for sbt-mima-plugin

update fork

catch up to master

update master

update with master

update fork branch

update fork

fork update

update fork

…ol file listing for the vacuum command

rahulsmahadev · 2022-01-11T14:22:21Z

core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

+      .doc("Sets the number of partitions to use for file listing")
+      .intConf
+      .checkValue(_ > 0, "fileListing.parallelism must be positive")
+      .createWithDefault(200)


should we set a default here ? or do you think defaulting to spark.sql.shuffle.partitions is more ideal ?

good point, I even think that defaulting to spark.default.parallelism is more ideal.

@rahulsmahadev @vkorukanti any comments on this PR?

zsxwing

To minimize the surprises of this change, can we use the following logic to decide the parallelism?

If spark.databricks.delta.vacuum.fileListing.parallelism is set, use it. Otherwise,
If spark.sql.sources.parallelPartitionDiscovery.parallelism is set, use it. Otherwise,
Use 200.

Then if someone has set spark.sql.sources.parallelPartitionDiscovery.parallelism today, they can upgrade to the new version without any change.

core/src/main/scala/org/apache/spark/sql/delta/commands/VacuumCommand.scala

… parallel listing.

zsxwing · 2022-04-04T17:22:43Z

core/src/main/scala/org/apache/spark/sql/delta/commands/VacuumCommand.scala

+      val fileListingParallelism =
+        spark.sessionState.conf.getConf(DeltaSQLConf.DELTA_VACUUM_FILE_LISTING_PARALLELISM)
+          .getOrElse(
+            Option(spark.sessionState.conf.parallelPartitionDiscoveryParallelism).getOrElse(200)
+          )


Could you add unit tests for this config behavior? spark.sessionState.conf.parallelPartitionDiscoveryParallelism will never return null if I recall correctly.

spark.sessionState.conf.parallelPartitionDiscoveryParallelism is a spark conf. Not sure how can I check if it is set or not. I can compare the value to its default one 10000. but if the default changes in spark, we will use always the default instead of 200.

Yep. This is a good point. I found the following code should work:

try { spark.getConfString(SQLConf.PARALLEL_PARTITION_DISCOVERY_PARALLELISM.key) Some(spark.sessionState.conf.parallelPartitionDiscoveryParallelism) } catch { case _: NoSuchElementException => None }

zsxwing · 2022-04-04T17:24:33Z

core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

@@ -253,6 +253,13 @@ trait DeltaSQLConfBase {
      .checkValue(_ > 0, "parallelDelete.parallelism must be positive")
      .createOptional

+  val DELTA_VACUUM_FILE_LISTING_PARALLELISM =
+    buildConf("vacuum.fileListing.parallelism")
+      .doc("Sets the number of partitions to use for file listing")


Sets the number of parallelism to use for file listing recursively during a vacuum command. If not set, defaults to 'spark.sql.sources.parallelPartitionDiscovery.parallelism'. Set the number to prevent file listing from generating too many tasks.

zsxwing · 2022-09-14T17:55:05Z

@JassAbidi Could you take a look at my comments when you have time? Thanks!

JassAbidi and others added 16 commits October 23, 2019 21:48

Merge pull request #1 from delta-io/master

37174b3

[SC-24892] Add typesafe bintray repo for sbt-mima-plugin

Merge branch 'master' of https://github.com/delta-io/delta

17748af

Merge pull request #2 from delta-io/master

5fb6a77

update fork

Merge pull request #3 from delta-io/master

4ec9fec

catch up to master

Merge pull request #4 from delta-io/master

67cacce

update master

Merge pull request #5 from delta-io/master

0feb9c6

update with master

Merge pull request #6 from delta-io/master

cf7a9c9

update fork branch

Merge pull request #7 from delta-io/master

4289d09

update fork

Merge pull request #8 from delta-io/master

7041c1a

update fork

Merge pull request #9 from delta-io/master

eee16c7

update fork

Merge pull request #10 from delta-io/master

cad3c3d

fork update

Merge pull request #11 from delta-io/master

9e96cd7

update fork

Merge pull request #13 from delta-io/master

c8eb27c

update fork

Merge pull request #14 from delta-io/master

5717ee3

update fork

Merge pull request #15 from delta-io/master

f5c35cd

update fork

create DELTA_VACUUM_FILE_LISTING_PARALLELISM conf and use it to contr…

1d8c44f

…ol file listing for the vacuum command

rahulsmahadev reviewed Jan 11, 2022

View reviewed changes

dennyglee mentioned this pull request Jan 12, 2022

Streaming to Delta Sink, Sharp Increase in Batch Time after ~36h Using Delta-1.0.0 #859

Open

dennyglee assigned vkorukanti Jan 18, 2022

- set fileListing parallelism to spark default parallelism

1e38556

JassAbidi requested a review from rahulsmahadev March 8, 2022 07:51

zsxwing requested changes Mar 11, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/sql/delta/commands/VacuumCommand.scala Outdated Show resolved Hide resolved

keep parallelPartitionDiscoveryParallelism and use 200 as default for…

6803da7

… parallel listing.

JassAbidi requested a review from zsxwing April 3, 2022 10:29

zsxwing requested changes Apr 4, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce a new conf for vacuum parallel listing #886

Introduce a new conf for vacuum parallel listing #886

Uh oh!

JassAbidi commented Jan 9, 2022 •

edited by zsxwing

Loading

Uh oh!

rahulsmahadev Jan 11, 2022

Uh oh!

JassAbidi Jan 12, 2022

Uh oh!

JassAbidi Mar 8, 2022

Uh oh!

zsxwing left a comment

Uh oh!

Uh oh!

zsxwing Apr 4, 2022

Uh oh!

JassAbidi May 8, 2022

Uh oh!

zsxwing May 9, 2022

Uh oh!

zsxwing Apr 4, 2022

Uh oh!

zsxwing commented Sep 14, 2022

Uh oh!

Uh oh!

Introduce a new conf for vacuum parallel listing #886

Are you sure you want to change the base?

Introduce a new conf for vacuum parallel listing #886

Uh oh!

Conversation

JassAbidi commented Jan 9, 2022 • edited by zsxwing Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahulsmahadev Jan 11, 2022

Choose a reason for hiding this comment

Uh oh!

JassAbidi Jan 12, 2022

Choose a reason for hiding this comment

Uh oh!

JassAbidi Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zsxwing Apr 4, 2022

Choose a reason for hiding this comment

Uh oh!

JassAbidi May 8, 2022

Choose a reason for hiding this comment

Uh oh!

zsxwing May 9, 2022

Choose a reason for hiding this comment

Uh oh!

zsxwing Apr 4, 2022

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Sep 14, 2022

Uh oh!

Uh oh!

JassAbidi commented Jan 9, 2022 •

edited by zsxwing

Loading