-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Optimize vacuum command by adding a minor GC step #887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[SC-24892] Add typesafe bintray repo for sbt-mima-plugin
update fork
catch up to master
update master
update with master
update fork branch
update fork
update fork
update fork
fork update
update fork
update fork
update fork
update fork
Option(pathToString(filePath)) | ||
} | ||
validFileOpt.toSeq.flatMap { f => | ||
// paths are relative so provide '/' as the basePath. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was in the original code, but it's confusing to somebody who doesn't already know how getAllSubdirs
works. Is there a way for the comment to hint at what was achieved, and why, rather than what was done? (reader can see easily enough that base path is "/").
In most path APIs, "base" refers to a prefix that should eventually be prepended to the path name. But here it refers to a common prefix the path name is expected to already contain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// get all parent paths that file has, those paths will be used to find the untracked files and dirs
could this remove the confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, something like that would be great
case tombstone: RemoveFile if tombstone.delTimestamp < deleteBeforeTimestamp => | ||
Nil | ||
getFilePath(tombstone, fs, reservoirBase, relativizeIgnoreError) | ||
.map(f => TrackedFile(null, f)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is safe. getFilePath
enumerates the parent paths of each file, e.g.
p1=foo/p2=bar/my_file
would return
p1=foo
p1=foo/p2=bar
p1=foo/p2=bar/my_file
For adds this is safe -- a subdirectory referenced by even one kept file must still be non-empty.
For deletes this is SUPER UNSAFE -- we cannot prove a subdirectory is empty, by observing that one removed file no longer references it.
I'm pretty sure minor GC deletes need to just take the file name as-is, and not attempt to enumerate parent paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, getFilePath
enumerates the parent paths for each file, but only empty paths will be deleted (tryDeleteNonRecursive don't delete non-empty dirs) so this should be safe. I wanted the minor GC to be aggressive and remove the empty dirs too to optimize the major GC (If we don't remove empty paths during minor GC, they will be listed and deleted during the major GC )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two objections...
(a) Relying on a "delete this" command to not actually delete something is scary IMO.
(b) In the common case where data grow and files get deleted because of OPTIMIZE etc, the aggressive approach will be issuing gazillions of unnecessary directory delete attempts. Aggravated by the duplication of requests pointed out elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, I will adjust it to delete only files found in the state.
// abc should be deleted because it become empty after the minor gc deleted file2.txt | ||
CheckFiles(Seq("abc", "abc/file2.txt"), exist = false), | ||
GC(dryRun = false, Seq(reservoirDir.toString)), // nothing should be deleted | ||
CheckFiles(Seq("file1.txt")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need much more extensive testing to validate this change. In particular, tests involving partitioned tables with partitions that may or may not become empty during a minor gc.
// paths are relative so provide '/' as the basePath. | ||
Seq(f).flatMap { file => | ||
val dirs = getAllSubdirs("/", file, fs) | ||
dirs ++ Iterator(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I know the original code had this, but this code will produce many duplicate parent paths, one for each file that was deleted from a given directory. Given the extreme latency of delete calls on some cloud storage platforms, it seems like we would want to de-duplicate the file list before making API calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we delete only file paths from the state, this problem should not happen right? we can deduplicate the add files also before the major GC, this will not impact the delete API calls but will probably make the major GC faster (should make the left-anti join between discovered files and add files faster but will add a shuffle to deduplicate the data). what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may not add a shuffle and the exchange will be reused (the de-duplication and the join use the same key). I will double-check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, the API calls are the slowest part of vacuum, and by a wide margin, so the dedup is worth it even this query does technically slow down a bit.
- deduplicate valid files before to avoid unnecessary delete calls. - add more test scenarios.
@JassAbidi what's the status of this PR? I see that Ryan Johnson last commented on it on Jan 11 2022. Can you please respond to his PR comments? |
@JassAbidi What is the performance impact of this change? How is it tested to verify correctness? The PR description contains no information about either one. |
it's tested using the already existing vacuum correctness test. more scenarios were added to that test to covert situations where major GC is impacted by the minor GC ( directory emptied by minor GC should be deleted by the major GC of the same vacuum and should not wait for the next vacuum to be removed). |
this PR optimizes vacuum command by adding a new minor GC step to the existing vacuum algorithm.
the new GC algorithm will be composed of two steps:
the major GC will benefit from the minor GC since all the tracked files to delete will be already deleted and only untracked files/dirs will need to be cleaned in this step. Also, empty dirs because of files deleted by the current vacuum will not wait until the next vacuum is deleted and will be vacuumed during the same GC cycle.
Does this PR introduce any user-facing change?
No
How was this patch tested?