-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Support concurrent write for vector memtable #13675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -698,6 +699,10 @@ class MemTable final : public ReadOnlyMemTable { | |||
if (update_counters.num_range_deletes > 0) { | |||
num_range_deletes_.fetch_add(update_counters.num_range_deletes, | |||
std::memory_order_relaxed); | |||
// noop for skip-list memtable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds regression to the allow_concurrent_write=false case when uncommented, which I could not explain since BatchPostProcess is only called for concurrent inserts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty awesome!
memtable/vectorrep.cc
Outdated
} | ||
|
||
void delete_vector(void* ptr) { | ||
std::vector<const char*>* v = static_cast<std::vector<const char*>*>(ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw I like auto* v = static_cast<...*>(...)
when simply assigning the result of static_cast.
memtable/vectorrep.cc
Outdated
@@ -103,16 +107,32 @@ class VectorRep : public MemTableRep { | |||
using Bucket = std::vector<const char*>; | |||
std::shared_ptr<Bucket> bucket_; | |||
mutable port::RWMutex rwlock_; | |||
RelaxedAtomic<size_t> bucket_size_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be a false sharing situation where updates to the bucket_size_ hurt the efficiency of the rwlock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated the counter to be cacheline aligned.
memtable/vectorrep.cc
Outdated
} | ||
|
||
void VectorRep::InsertConcurrently(KeyHandle handle) { | ||
void* v = tl_writes_.Get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes more sense to me to put the static_cast before assigning to v from void*, rather than on use of v.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated here and other places.
memtable/vectorrep.cc
Outdated
void VectorRep::InsertConcurrently(KeyHandle handle) { | ||
void* v = tl_writes_.Get(); | ||
if (!v) { | ||
v = new std::vector<const char*>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience std::deque is faster as an accumulator, because there's no need to copy existing written entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. However, I don't see a noticeable performance difference when running benchmark with 100 and 1000 as batch size yet. I added a TlBucket type to make it easier to switch the accumulator type later.
v->clear(); | ||
v->shrink_to_fit(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm there could be a small risk of a small de facto leak into thread locals here (number of threads ever inserting into this memtable * sizeof vector), but that's probably ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated here to just free the vector instead. It doesn't show a noticeable performance difference.
sizeof(std::remove_reference<decltype(*bucket_)>::type::value_type); | ||
} | ||
|
||
void VectorRep::BatchPostProcess() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Do you know why one of the WriteBatchInternal::InsertInto() overloads doesn't call inserter.PostProcess()? Seems suspicious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that overload takes a WriteGroup and is called for the case when memtable writes are not done in parallel. Let me make it more explicit.
@cbi42 has updated the pull request. You must reimport the pull request before landing. |
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
sizeof(std::remove_reference<decltype(*bucket_)>::type::value_type); | ||
} | ||
|
||
void VectorRep::BatchPostProcess() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that overload takes a WriteGroup and is called for the case when memtable writes are not done in parallel. Let me make it more explicit.
memtable/vectorrep.cc
Outdated
@@ -103,16 +107,32 @@ class VectorRep : public MemTableRep { | |||
using Bucket = std::vector<const char*>; | |||
std::shared_ptr<Bucket> bucket_; | |||
mutable port::RWMutex rwlock_; | |||
RelaxedAtomic<size_t> bucket_size_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated the counter to be cacheline aligned.
v->clear(); | ||
v->shrink_to_fit(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated here to just free the vector instead. It doesn't show a noticeable performance difference.
memtable/vectorrep.cc
Outdated
void VectorRep::InsertConcurrently(KeyHandle handle) { | ||
void* v = tl_writes_.Get(); | ||
if (!v) { | ||
v = new std::vector<const char*>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. However, I don't see a noticeable performance difference when running benchmark with 100 and 1000 as batch size yet. I added a TlBucket type to make it easier to switch the accumulator type later.
memtable/vectorrep.cc
Outdated
} | ||
|
||
void VectorRep::InsertConcurrently(KeyHandle handle) { | ||
void* v = tl_writes_.Get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated here and other places.
Summary: Some usage of vector memtable is bottlenecked in the memtable insertion path when using multiple writers. This PR adds support for concurrent writes for the vector memtable. The updates from each concurrent writer are buffered in a thread local vector. When a writer is done, MemTable::BatchPostProcess() is called to flush the thread local updates to the main vector. TSAN test and function comment suggest that ApproximateMemoryUsage() needs to be thread-safe, so its implementation is updated to provide thread-safe access. Together with unordered_write, benchmark shows much improved insertion throughput. Pull Request resolved: facebook#13675 Test Plan: - new unit test - enabled some coverage of vector memtable in stress test - Performance benchmark: benchmarked memtable insertion performance with by running fillrandom 20 times - Compare branch and main performance with one thread and write batch size 100: - main: 4896888.950 ops/sec - branch: 4923366.350 ops/sec - Benchmark this branch by configuring different threads, allow_concurrent_memtable_write, and unordered_write. Performance ratio is computed as current ops/sec divided by ops/sec at 1 thread with the same options. allow_concurrent | unordered_write | Threads | ops/sec | Performance Ratio -- | -- | -- | -- | -- 0 | 0 | 1 | 4923367 | 1.0 0 | 0 | 2 | 5215640 | 1.1 0 | 0 | 4 | 5588510 | 1.1 0 | 0 | 8 | 6077525 | 1.2 1 | 0 | 1 | 4919060 | 1.0 1 | 0 | 2 | 5821922 | 1.2 1 | 0 | 4 | 7850395 | 1.6 1 | 0 | 8 | 10516600 | 2.1 1 | 1 | 1 | 5050004 | 1.0 1 | 1 | 2 | 8489834 | 1.7 1 | 1 | 4 | 14439513 | 2.9 1 | 1 | 8 | 21538098 | 4.3 ``` mkdir -p /tmp/bench_$1 export TEST_TMPDIR=/tmp/bench_$1 memtablerep_value=${6:-vector} (for I in $(seq 1 $2) do /data/users/changyubi/vscode-root/rocksdb/$1 --benchmarks=fillrandom --seed=1722808058 --write_buffer_size=67108864 --min_write_buffer_number_to_merge=1000 --max_write_buffer_number=1000 --enable_pipelined_write=0 --memtablerep=$memtablerep_value --disable_auto_compactions=1 --disable_wal=1 --avoid_flush_during_shutdown=1 --allow_concurrent_memtable_write=${5:-0} --unordered_write=$4 --batch_size=1 --threads=$3 2>&1 | grep "fillrandom" done;) | awk '{ t += $5; c++; print } END { printf ("%9.3f\n", 1.0 * t / c) }'; ``` Reviewed By: pdillinger Differential Revision: D76641755 Pulled By: cbi42 fbshipit-source-id: c107ba42749855ad4fd1f52491eb93900757542e
Summary: Some usage of vector memtable is bottlenecked in the memtable insertion path when using multiple writers. This PR adds support for concurrent writes for the vector memtable. The updates from each concurrent writer are buffered in a thread local vector. When a writer is done, MemTable::BatchPostProcess() is called to flush the thread local updates to the main vector. TSAN test and function comment suggest that ApproximateMemoryUsage() needs to be thread-safe, so its implementation is updated to provide thread-safe access.
Together with unordered_write, benchmark shows much improved insertion throughput.
Test plan: