Optimize vacuum command by adding a minor GC step #887

JassAbidi · 2022-01-09T23:14:07Z

this PR optimizes vacuum command by adding a new minor GC step to the existing vacuum algorithm.
the new GC algorithm will be composed of two steps:

[step1] Minor GC: use the delta snapshot to determine the tracked files that should be removed and delete them. We don't need to do any recursive listing and this can be done in the same pass that is extracting the valid paths.
[step2] major/full GC: same current behavior.

the major GC will benefit from the minor GC since all the tracked files to delete will be already deleted and only untracked files/dirs will need to be cleaned in this step. Also, empty dirs because of files deleted by the current vacuum will not wait until the next vacuum is deleted and will be vacuumed during the same GC cycle.

Does this PR introduce any user-facing change?
No

How was this patch tested?

using the existing vacuum correctness test.
more scenarios were added to cover behavior changes introduced by this PR.

[SC-24892] Add typesafe bintray repo for sbt-mima-plugin

update fork

catch up to master

update master

update with master

update fork branch

update fork

fork update

update fork

ryan-johnson-databricks · 2022-01-10T13:13:30Z

core/src/main/scala/org/apache/spark/sql/delta/commands/VacuumCommand.scala

+      Option(pathToString(filePath))
+    }
+    validFileOpt.toSeq.flatMap { f =>
+      // paths are relative so provide '/' as the basePath.


I know this was in the original code, but it's confusing to somebody who doesn't already know how getAllSubdirs works. Is there a way for the comment to hint at what was achieved, and why, rather than what was done? (reader can see easily enough that base path is "/").

In most path APIs, "base" refers to a prefix that should eventually be prepended to the path name. But here it refers to a common prefix the path name is expected to already contain.

// get all parent paths that file has, those paths will be used to find the untracked files and dirs could this remove the confusion?

Yeah, something like that would be great

ryan-johnson-databricks · 2022-01-10T13:22:41Z

core/src/main/scala/org/apache/spark/sql/delta/commands/VacuumCommand.scala

              case tombstone: RemoveFile if tombstone.delTimestamp < deleteBeforeTimestamp =>
-                Nil
+                getFilePath(tombstone, fs, reservoirBase, relativizeIgnoreError)
+                  .map(f => TrackedFile(null, f))


I don't think this is safe. getFilePath enumerates the parent paths of each file, e.g.

p1=foo/p2=bar/my_file

would return

p1=foo p1=foo/p2=bar p1=foo/p2=bar/my_file

For adds this is safe -- a subdirectory referenced by even one kept file must still be non-empty.
For deletes this is SUPER UNSAFE -- we cannot prove a subdirectory is empty, by observing that one removed file no longer references it.

I'm pretty sure minor GC deletes need to just take the file name as-is, and not attempt to enumerate parent paths.

You're right, getFilePath enumerates the parent paths for each file, but only empty paths will be deleted (tryDeleteNonRecursive don't delete non-empty dirs) so this should be safe. I wanted the minor GC to be aggressive and remove the empty dirs too to optimize the major GC (If we don't remove empty paths during minor GC, they will be listed and deleted during the major GC )

Two objections...
(a) Relying on a "delete this" command to not actually delete something is scary IMO.
(b) In the common case where data grow and files get deleted because of OPTIMIZE etc, the aggressive approach will be issuing gazillions of unnecessary directory delete attempts. Aggravated by the duplication of requests pointed out elsewhere.

makes sense, I will adjust it to delete only files found in the state.

ryan-johnson-databricks · 2022-01-10T13:23:38Z

core/src/test/scala/org/apache/spark/sql/delta/DeltaVacuumSuite.scala

+        // abc should be deleted because it become empty after the minor gc deleted file2.txt
+        CheckFiles(Seq("abc", "abc/file2.txt"), exist = false),
+        GC(dryRun = false, Seq(reservoirDir.toString)), // nothing should be deleted
+        CheckFiles(Seq("file1.txt")),


We need much more extensive testing to validate this change. In particular, tests involving partitioned tables with partitions that may or may not become empty during a minor gc.

ryan-johnson-databricks · 2022-01-10T13:25:55Z

core/src/main/scala/org/apache/spark/sql/delta/commands/VacuumCommand.scala

+      // paths are relative so provide '/' as the basePath.
+      Seq(f).flatMap { file =>
+        val dirs = getAllSubdirs("/", file, fs)
+        dirs ++ Iterator(file)


Again, I know the original code had this, but this code will produce many duplicate parent paths, one for each file that was deleted from a given directory. Given the extreme latency of delete calls on some cloud storage platforms, it seems like we would want to de-duplicate the file list before making API calls?

If we delete only file paths from the state, this problem should not happen right? we can deduplicate the add files also before the major GC, this will not impact the delete API calls but will probably make the major GC faster (should make the left-anti join between discovered files and add files faster but will add a shuffle to deduplicate the data). what do you think?

I think it may not add a shuffle and the exchange will be reused (the de-duplication and the join use the same key). I will double-check it.

Either way, the API calls are the slowest part of vacuum, and by a wide margin, so the dedup is worth it even this query does technically slow down a bit.

- deduplicate valid files before to avoid unnecessary delete calls. - add more test scenarios.

scottsand-db · 2022-03-08T17:57:29Z

@JassAbidi what's the status of this PR? I see that Ryan Johnson last commented on it on Jan 11 2022. Can you please respond to his PR comments?

ryan-johnson-databricks · 2022-03-08T18:40:44Z

@JassAbidi What is the performance impact of this change? How is it tested to verify correctness? The PR description contains no information about either one.

JassAbidi · 2022-03-08T21:32:18Z

it's tested using the already existing vacuum correctness test. more scenarios were added to that test to covert situations where major GC is impacted by the minor GC ( directory emptied by minor GC should be deleted by the major GC of the same vacuum and should not wait for the next vacuum to be removed).
I'm not sure how can I test performance, should I build the branch and run on an EMR cluster for example or can I do it in a unit test?

JassAbidi and others added 17 commits October 23, 2019 21:48

Merge pull request #1 from delta-io/master

37174b3

[SC-24892] Add typesafe bintray repo for sbt-mima-plugin

Merge branch 'master' of https://github.com/delta-io/delta

17748af

Merge pull request #2 from delta-io/master

5fb6a77

update fork

Merge pull request #3 from delta-io/master

4ec9fec

catch up to master

Merge pull request #4 from delta-io/master

67cacce

update master

Merge pull request #5 from delta-io/master

0feb9c6

update with master

Merge pull request #6 from delta-io/master

cf7a9c9

update fork branch

Merge pull request #7 from delta-io/master

4289d09

update fork

Merge pull request #8 from delta-io/master

7041c1a

update fork

Merge pull request #9 from delta-io/master

eee16c7

update fork

Merge pull request #10 from delta-io/master

cad3c3d

fork update

Merge pull request #11 from delta-io/master

9e96cd7

update fork

Merge pull request #13 from delta-io/master

c8eb27c

update fork

Merge pull request #14 from delta-io/master

5717ee3

update fork

Merge pull request #15 from delta-io/master

f5c35cd

update fork

add a minor gc step to delete tombstone file in the delta snapshot.

a6c4764

resolve postfix operator isNotNull compilation issue

e84eec2

ryan-johnson-databricks suggested changes Jan 10, 2022

View reviewed changes

JassAbidi changed the title ~~Optimise vacuum command by adding a minor GC step~~ Optimize vacuum command by adding a minor GC step Jan 10, 2022

- delete only files in the state.

14e9b72

- deduplicate valid files before to avoid unnecessary delete calls. - add more test scenarios.

felipepessoto mentioned this pull request Apr 21, 2023

[Feature Request] Improve unit test execution time #1707

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize vacuum command by adding a minor GC step #887

Optimize vacuum command by adding a minor GC step #887

Uh oh!

JassAbidi commented Jan 9, 2022 •

edited

Loading

Uh oh!

ryan-johnson-databricks Jan 10, 2022

Uh oh!

JassAbidi Jan 11, 2022

Uh oh!

ryan-johnson-databricks Jan 11, 2022

Uh oh!

ryan-johnson-databricks Jan 10, 2022

Uh oh!

JassAbidi Jan 10, 2022

Uh oh!

ryan-johnson-databricks Jan 10, 2022

Uh oh!

JassAbidi Jan 10, 2022

Uh oh!

ryan-johnson-databricks Jan 10, 2022

Uh oh!

ryan-johnson-databricks Jan 10, 2022

Uh oh!

JassAbidi Jan 10, 2022

Uh oh!

JassAbidi Jan 11, 2022 •

edited

Loading

Uh oh!

ryan-johnson-databricks Jan 11, 2022

Uh oh!

scottsand-db commented Mar 8, 2022

Uh oh!

ryan-johnson-databricks commented Mar 8, 2022

Uh oh!

JassAbidi commented Mar 8, 2022

Uh oh!

Uh oh!

Optimize vacuum command by adding a minor GC step #887

Are you sure you want to change the base?

Optimize vacuum command by adding a minor GC step #887

Uh oh!

Conversation

JassAbidi commented Jan 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JassAbidi Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottsand-db commented Mar 8, 2022

Uh oh!

ryan-johnson-databricks commented Mar 8, 2022

Uh oh!

JassAbidi commented Mar 8, 2022

Uh oh!

Uh oh!

JassAbidi commented Jan 9, 2022 •

edited

Loading

JassAbidi Jan 11, 2022 •

edited

Loading