Skip to content

Optimize vacuum command by adding a minor GC step #887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

JassAbidi
Copy link
Contributor

@JassAbidi JassAbidi commented Jan 9, 2022

this PR optimizes vacuum command by adding a new minor GC step to the existing vacuum algorithm.
the new GC algorithm will be composed of two steps:

  • [step1] Minor GC: use the delta snapshot to determine the tracked files that should be removed and delete them. We don't need to do any recursive listing and this can be done in the same pass that is extracting the valid paths.
  • [step2] major/full GC: same current behavior.

the major GC will benefit from the minor GC since all the tracked files to delete will be already deleted and only untracked files/dirs will need to be cleaned in this step. Also, empty dirs because of files deleted by the current vacuum will not wait until the next vacuum is deleted and will be vacuumed during the same GC cycle.

Does this PR introduce any user-facing change?
No

How was this patch tested?

  • using the existing vacuum correctness test.
  • more scenarios were added to cover behavior changes introduced by this PR.

Option(pathToString(filePath))
}
validFileOpt.toSeq.flatMap { f =>
// paths are relative so provide '/' as the basePath.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was in the original code, but it's confusing to somebody who doesn't already know how getAllSubdirs works. Is there a way for the comment to hint at what was achieved, and why, rather than what was done? (reader can see easily enough that base path is "/").

In most path APIs, "base" refers to a prefix that should eventually be prepended to the path name. But here it refers to a common prefix the path name is expected to already contain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// get all parent paths that file has, those paths will be used to find the untracked files and dirs could this remove the confusion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, something like that would be great

case tombstone: RemoveFile if tombstone.delTimestamp < deleteBeforeTimestamp =>
Nil
getFilePath(tombstone, fs, reservoirBase, relativizeIgnoreError)
.map(f => TrackedFile(null, f))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is safe. getFilePath enumerates the parent paths of each file, e.g.

p1=foo/p2=bar/my_file

would return

p1=foo
p1=foo/p2=bar
p1=foo/p2=bar/my_file

For adds this is safe -- a subdirectory referenced by even one kept file must still be non-empty.
For deletes this is SUPER UNSAFE -- we cannot prove a subdirectory is empty, by observing that one removed file no longer references it.

I'm pretty sure minor GC deletes need to just take the file name as-is, and not attempt to enumerate parent paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, getFilePath enumerates the parent paths for each file, but only empty paths will be deleted (tryDeleteNonRecursive don't delete non-empty dirs) so this should be safe. I wanted the minor GC to be aggressive and remove the empty dirs too to optimize the major GC (If we don't remove empty paths during minor GC, they will be listed and deleted during the major GC )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two objections...
(a) Relying on a "delete this" command to not actually delete something is scary IMO.
(b) In the common case where data grow and files get deleted because of OPTIMIZE etc, the aggressive approach will be issuing gazillions of unnecessary directory delete attempts. Aggravated by the duplication of requests pointed out elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I will adjust it to delete only files found in the state.

// abc should be deleted because it become empty after the minor gc deleted file2.txt
CheckFiles(Seq("abc", "abc/file2.txt"), exist = false),
GC(dryRun = false, Seq(reservoirDir.toString)), // nothing should be deleted
CheckFiles(Seq("file1.txt")),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need much more extensive testing to validate this change. In particular, tests involving partitioned tables with partitions that may or may not become empty during a minor gc.

// paths are relative so provide '/' as the basePath.
Seq(f).flatMap { file =>
val dirs = getAllSubdirs("/", file, fs)
dirs ++ Iterator(file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I know the original code had this, but this code will produce many duplicate parent paths, one for each file that was deleted from a given directory. Given the extreme latency of delete calls on some cloud storage platforms, it seems like we would want to de-duplicate the file list before making API calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we delete only file paths from the state, this problem should not happen right? we can deduplicate the add files also before the major GC, this will not impact the delete API calls but will probably make the major GC faster (should make the left-anti join between discovered files and add files faster but will add a shuffle to deduplicate the data). what do you think?

Copy link
Contributor Author

@JassAbidi JassAbidi Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may not add a shuffle and the exchange will be reused (the de-duplication and the join use the same key). I will double-check it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, the API calls are the slowest part of vacuum, and by a wide margin, so the dedup is worth it even this query does technically slow down a bit.

@JassAbidi JassAbidi changed the title Optimise vacuum command by adding a minor GC step Optimize vacuum command by adding a minor GC step Jan 10, 2022
- deduplicate valid files before to avoid unnecessary delete calls.
- add more test scenarios.
@scottsand-db
Copy link
Collaborator

@JassAbidi what's the status of this PR? I see that Ryan Johnson last commented on it on Jan 11 2022. Can you please respond to his PR comments?

@ryan-johnson-databricks
Copy link
Collaborator

@JassAbidi What is the performance impact of this change? How is it tested to verify correctness? The PR description contains no information about either one.

@JassAbidi
Copy link
Contributor Author

it's tested using the already existing vacuum correctness test. more scenarios were added to that test to covert situations where major GC is impacted by the minor GC ( directory emptied by minor GC should be deleted by the major GC of the same vacuum and should not wait for the next vacuum to be removed).
I'm not sure how can I test performance, should I build the branch and run on an EMR cluster for example or can I do it in a unit test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants