Support concurrent write for vector memtable #13675

cbi42 · 2025-06-06T20:30:23Z

Summary: Some usage of vector memtable is bottlenecked in the memtable insertion path when using multiple writers. This PR adds support for concurrent writes for the vector memtable. The updates from each concurrent writer are buffered in a thread local vector. When a writer is done, MemTable::BatchPostProcess() is called to flush the thread local updates to the main vector. TSAN test and function comment suggest that ApproximateMemoryUsage() needs to be thread-safe, so its implementation is updated to provide thread-safe access.

Together with unordered_write, benchmark shows much improved insertion throughput.

Test plan:

new unit test
enabled some coverage of vector memtable in stress test
Performance benchmark: benchmarked memtable insertion performance with by running fillrandom 20 times
- Compare branch and main performance with one thread and write batch size 100:
  - main: 4896888.950 ops/sec
  - branch: 4923366.350 ops/sec
- Benchmark this branch by configuring different threads, allow_concurrent_memtable_write, and unordered_write. Performance ratio is computed as current ops/sec divided by ops/sec at 1 thread with the same options.

allow_concurrent	unordered_write	Threads	ops/sec	Performance Ratio
0	0	1	4923367	1.0
0	0	2	5215640	1.1
0	0	4	5588510	1.1
0	0	8	6077525	1.2
1	0	1	4919060	1.0
1	0	2	5821922	1.2
1	0	4	7850395	1.6
1	0	8	10516600	2.1
1	1	1	5050004	1.0
1	1	2	8489834	1.7
1	1	4	14439513	2.9
1	1	8	21538098	4.3

mkdir -p /tmp/bench_$1
export TEST_TMPDIR=/tmp/bench_$1

memtablerep_value=${6:-vector}

(for I in $(seq 1 $2)
do
	/data/users/changyubi/vscode-root/rocksdb/$1 --benchmarks=fillrandom --seed=1722808058 --write_buffer_size=67108864 --min_write_buffer_number_to_merge=1000 --max_write_buffer_number=1000 --enable_pipelined_write=0 --memtablerep=$memtablerep_value --disable_auto_compactions=1 --disable_wal=1 --avoid_flush_during_shutdown=1 --allow_concurrent_memtable_write=${5:-0} --unordered_write=$4 --batch_size=1 --threads=$3 2>&1 | grep "fillrandom"
done;) | awk '{ t += $5; c++; print } END { printf ("%9.3f\n", 1.0 * t / c) }';

facebook-github-bot · 2025-06-14T01:09:14Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cbi42 · 2025-06-14T01:12:08Z

db/memtable.h

@@ -698,6 +699,10 @@ class MemTable final : public ReadOnlyMemTable {
    if (update_counters.num_range_deletes > 0) {
      num_range_deletes_.fetch_add(update_counters.num_range_deletes,
                                   std::memory_order_relaxed);
+      // noop for skip-list memtable


This adds regression to the allow_concurrent_write=false case when uncommented, which I could not explain since BatchPostProcess is only called for concurrent inserts.

pdillinger

Looks pretty awesome!

pdillinger · 2025-06-16T21:44:03Z

memtable/vectorrep.cc

+}
+
+void delete_vector(void* ptr) {
+  std::vector<const char*>* v = static_cast<std::vector<const char*>*>(ptr);


Btw I like auto* v = static_cast<...*>(...) when simply assigning the result of static_cast.

pdillinger · 2025-06-16T21:48:30Z

memtable/vectorrep.cc

@@ -103,16 +107,32 @@ class VectorRep : public MemTableRep {
  using Bucket = std::vector<const char*>;
  std::shared_ptr<Bucket> bucket_;
  mutable port::RWMutex rwlock_;
+  RelaxedAtomic<size_t> bucket_size_;


There could be a false sharing situation where updates to the bucket_size_ hurt the efficiency of the rwlock

Thanks, updated the counter to be cacheline aligned.

pdillinger · 2025-06-16T22:07:38Z

memtable/vectorrep.cc

+}
+
+void VectorRep::InsertConcurrently(KeyHandle handle) {
+  void* v = tl_writes_.Get();


It makes more sense to me to put the static_cast before assigning to v from void*, rather than on use of v.

Updated here and other places.

pdillinger · 2025-06-16T22:09:52Z

memtable/vectorrep.cc

+void VectorRep::InsertConcurrently(KeyHandle handle) {
+  void* v = tl_writes_.Get();
+  if (!v) {
+    v = new std::vector<const char*>();


In my experience std::deque is faster as an accumulator, because there's no need to copy existing written entries.

That makes sense. However, I don't see a noticeable performance difference when running benchmark with 100 and 1000 as batch size yet. I added a TlBucket type to make it easier to switch the accumulator type later.

pdillinger · 2025-06-16T22:46:33Z

memtable/vectorrep.cc

+    v->clear();
+    v->shrink_to_fit();
+  }
+}


Hmm there could be a small risk of a small de facto leak into thread locals here (number of threads ever inserting into this memtable * sizeof vector), but that's probably ok

Good point, updated here to just free the vector instead. It doesn't show a noticeable performance difference.

pdillinger · 2025-06-16T22:47:02Z

memtable/vectorrep.cc

+         sizeof(std::remove_reference<decltype(*bucket_)>::type::value_type);
+}
+
+void VectorRep::BatchPostProcess() {


Hmm. Do you know why one of the WriteBatchInternal::InsertInto() overloads doesn't call inserter.PostProcess()? Seems suspicious

Yes, that overload takes a WriteGroup and is called for the case when memtable writes are not done in parallel. Let me make it more explicit.

facebook-github-bot · 2025-06-18T06:13:15Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-06-18T06:45:19Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cbi42

Thanks for the review!

cbi42 · 2025-06-18T01:42:07Z

memtable/vectorrep.cc

+         sizeof(std::remove_reference<decltype(*bucket_)>::type::value_type);
+}
+
+void VectorRep::BatchPostProcess() {


Yes, that overload takes a WriteGroup and is called for the case when memtable writes are not done in parallel. Let me make it more explicit.

cbi42 · 2025-06-18T04:41:20Z

memtable/vectorrep.cc

@@ -103,16 +107,32 @@ class VectorRep : public MemTableRep {
  using Bucket = std::vector<const char*>;
  std::shared_ptr<Bucket> bucket_;
  mutable port::RWMutex rwlock_;
+  RelaxedAtomic<size_t> bucket_size_;


Thanks, updated the counter to be cacheline aligned.

cbi42 · 2025-06-18T04:55:17Z

memtable/vectorrep.cc

+    v->clear();
+    v->shrink_to_fit();
+  }
+}


Good point, updated here to just free the vector instead. It doesn't show a noticeable performance difference.

cbi42 · 2025-06-18T05:50:24Z

memtable/vectorrep.cc

+void VectorRep::InsertConcurrently(KeyHandle handle) {
+  void* v = tl_writes_.Get();
+  if (!v) {
+    v = new std::vector<const char*>();


That makes sense. However, I don't see a noticeable performance difference when running benchmark with 100 and 1000 as batch size yet. I added a TlBucket type to make it easier to switch the accumulator type later.

cbi42 · 2025-06-18T05:50:58Z

memtable/vectorrep.cc

+}
+
+void VectorRep::InsertConcurrently(KeyHandle handle) {
+  void* v = tl_writes_.Get();


Updated here and other places.

facebook-github-bot · 2025-06-19T00:36:19Z

@cbi42 merged this pull request in c8aafdb.

Summary: Some usage of vector memtable is bottlenecked in the memtable insertion path when using multiple writers. This PR adds support for concurrent writes for the vector memtable. The updates from each concurrent writer are buffered in a thread local vector. When a writer is done, MemTable::BatchPostProcess() is called to flush the thread local updates to the main vector. TSAN test and function comment suggest that ApproximateMemoryUsage() needs to be thread-safe, so its implementation is updated to provide thread-safe access. Together with unordered_write, benchmark shows much improved insertion throughput. Pull Request resolved: facebook#13675 Test Plan: - new unit test - enabled some coverage of vector memtable in stress test - Performance benchmark: benchmarked memtable insertion performance with by running fillrandom 20 times - Compare branch and main performance with one thread and write batch size 100: - main: 4896888.950 ops/sec - branch: 4923366.350 ops/sec - Benchmark this branch by configuring different threads, allow_concurrent_memtable_write, and unordered_write. Performance ratio is computed as current ops/sec divided by ops/sec at 1 thread with the same options. allow_concurrent | unordered_write | Threads | ops/sec | Performance Ratio -- | -- | -- | -- | -- 0 | 0 | 1 | 4923367 | 1.0 0 | 0 | 2 | 5215640 | 1.1 0 | 0 | 4 | 5588510 | 1.1 0 | 0 | 8 | 6077525 | 1.2 1 | 0 | 1 | 4919060 | 1.0 1 | 0 | 2 | 5821922 | 1.2 1 | 0 | 4 | 7850395 | 1.6 1 | 0 | 8 | 10516600 | 2.1 1 | 1 | 1 | 5050004 | 1.0 1 | 1 | 2 | 8489834 | 1.7 1 | 1 | 4 | 14439513 | 2.9 1 | 1 | 8 | 21538098 | 4.3 ``` mkdir -p /tmp/bench_$1 export TEST_TMPDIR=/tmp/bench_$1 memtablerep_value=${6:-vector} (for I in $(seq 1 $2) do /data/users/changyubi/vscode-root/rocksdb/$1 --benchmarks=fillrandom --seed=1722808058 --write_buffer_size=67108864 --min_write_buffer_number_to_merge=1000 --max_write_buffer_number=1000 --enable_pipelined_write=0 --memtablerep=$memtablerep_value --disable_auto_compactions=1 --disable_wal=1 --avoid_flush_during_shutdown=1 --allow_concurrent_memtable_write=${5:-0} --unordered_write=$4 --batch_size=1 --threads=$3 2>&1 | grep "fillrandom" done;) | awk '{ t += $5; c++; print } END { printf ("%9.3f\n", 1.0 * t / c) }'; ``` Reviewed By: pdillinger Differential Revision: D76641755 Pulled By: cbi42 fbshipit-source-id: c107ba42749855ad4fd1f52491eb93900757542e

facebook-github-bot added the CLA Signed label Jun 6, 2025

cbi42 force-pushed the concurrent-vector branch from a60ba39 to 4b569e4 Compare June 13, 2025 23:13

cbi42 marked this pull request as ready for review June 14, 2025 01:08

cbi42 commented Jun 14, 2025

View reviewed changes

cbi42 requested review from pdillinger and hx235 and removed request for pdillinger June 14, 2025 04:40

pdillinger approved these changes Jun 16, 2025

View reviewed changes

cbi42 added 4 commits June 17, 2025 22:51

support concurrent vector inserts

8b66271

testing

3a24e60

fix regression

88be3ef

address review comments

Loading
Loading status checks…

6491ead

cbi42 force-pushed the concurrent-vector branch from 4b569e4 to 6491ead Compare June 18, 2025 06:13

cbi42 commented Jun 18, 2025

View reviewed changes

facebook-github-bot closed this in c8aafdb Jun 19, 2025

facebook-github-bot added the Merged label Jun 19, 2025

Support concurrent write for vector memtable #13675

Support concurrent write for vector memtable #13675

Conversation

cbi42 commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdillinger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

cbi42 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 19, 2025

Uh oh!

cbi42 commented Jun 6, 2025 •

edited

Loading