-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Version vector related extensions to the recovery restart logic #12284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
available tLogs change in such a way that the current in-progress recovery could stall.
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
maxEnd >= lastEnd.get()) { | ||
// Are the locked servers that were available in the previous iteration still available? If not, | ||
// restart recovery (as there is a chance that the recovery of the previous iteration would stall). | ||
knownLockedTLogIdsChanged = !isSubset(lastKnownLockedTLogIds, currentKnownLockedTLogIds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add an event, something like:
if (knownLockedTLogIdsChanged) {
TraceEvent("KnownLockedTLogsChanged")
.detail("Last", describe(lastKnownLockedTLogIds))
.detail("Current", describe(currentKnownLockedTLogIds));
}
static bool isSubset(const std::map<uint8_t, std::vector<uint16_t>>& mapA, | ||
const std::map<uint8_t, std::vector<uint16_t>>& mapB) { | ||
for (const auto& [keyA, valueA] : mapA) { | ||
if (mapB.find(keyA) == mapB.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need to lookup keyA
twice, can do a single lookup.
auto it = mapB.find(keyA);
if (it == mapB.end()) {
return false;
}
const auto& valueB = it->second;
@@ -2657,7 +2686,8 @@ ACTOR Future<Void> TagPartitionedLogSystem::epochEnd(Reference<AsyncVar<Referenc | |||
logSystem->logSystemType = prevState.logSystemType; | |||
logSystem->rejoins = rejoins; | |||
logSystem->lockResults = lockResults; | |||
logSystem->knownLockedTLogIds = knownLockedTLogIds; | |||
logSystem->knownLockedTLogIds = currentKnownLockedTLogIds; | |||
lastKnownLockedTLogIds = std::move(currentKnownLockedTLogIds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to save a future headache, add
currentKnownLockedTLogIds.clear(); // ensures safety if accessed later
if (mapB.find(keyA) == mapB.end()) { | ||
return false; | ||
} | ||
const auto& valueB = mapB.at(keyA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for clarity, and since this is not really a generic function, could rename keyA
to locality
(or log
to be consistent with caller, but I think "log" is misleading).
mapA
could become lastLockedTLogs
and mapB
could be newLockedTLogs
, or something similar.
Version vector/Unicast specific: Restart recovery if the list of available tLogs change in such a way that the current in-progress recovery could stall.
More details: Let "listA" be the list of known locked tLogs of the current in-progress recovery. Suppose the list of known locked tLogs change and "listB" be the current list. If "listA" is not a subset of "listB" then there is a chance that the current in-progress recovery would stall, hence restart recovery (even if the recovery version hasn't changed).
We don't need to restart recovery in "main" in this case because the cursor logic tracks these changes (code:
foundationdb/fdbserver/LogSystemPeekCursor.actor.cpp
Line 1155 in d45a17b
foundationdb/fdbserver/LogSystemPeekCursor.actor.cpp
Line 1120 in d45a17b
foundationdb/fdbserver/LogSystemPeekCursor.actor.cpp
Line 898 in d45a17b
Testing:
Joshua id (with version vector disabled): 20250730-232250-sre-7ba451e103463247 (started).
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)