Skip to content

Support concurrent write for vector memtable #13675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

cbi42
Copy link
Member

@cbi42 cbi42 commented Jun 6, 2025

Summary: Some usage of vector memtable is bottlenecked in the memtable insertion path when using multiple writers. This PR adds support for concurrent writes for the vector memtable. The updates from each concurrent writer are buffered in a thread local vector. When a writer is done, MemTable::BatchPostProcess() is called to flush the thread local updates to the main vector. TSAN test and function comment suggest that ApproximateMemoryUsage() needs to be thread-safe, so its implementation is updated to provide thread-safe access.

Together with unordered_write, benchmark shows much improved insertion throughput.

Test plan:

  • new unit test
  • enabled some coverage of vector memtable in stress test
  • Performance benchmark: benchmarked memtable insertion performance with by running fillrandom 20 times
    • Compare branch and main performance with one thread and write batch size 100:
      • main: 4896888.950 ops/sec
      • branch: 4923366.350 ops/sec
    • Benchmark this branch by configuring different threads, allow_concurrent_memtable_write, and unordered_write. Performance ratio is computed as current ops/sec divided by ops/sec at 1 thread with the same options.
allow_concurrent unordered_write Threads ops/sec Performance Ratio
0 0 1 4923367 1.0
0 0 2 5215640 1.1
0 0 4 5588510 1.1
0 0 8 6077525 1.2
1 0 1 4919060 1.0
1 0 2 5821922 1.2
1 0 4 7850395 1.6
1 0 8 10516600 2.1
1 1 1 5050004 1.0
1 1 2 8489834 1.7
1 1 4 14439513 2.9
1 1 8 21538098 4.3
mkdir -p /tmp/bench_$1
export TEST_TMPDIR=/tmp/bench_$1

memtablerep_value=${6:-vector}

(for I in $(seq 1 $2)
do
	/data/users/changyubi/vscode-root/rocksdb/$1 --benchmarks=fillrandom --seed=1722808058 --write_buffer_size=67108864 --min_write_buffer_number_to_merge=1000 --max_write_buffer_number=1000 --enable_pipelined_write=0 --memtablerep=$memtablerep_value --disable_auto_compactions=1 --disable_wal=1 --avoid_flush_during_shutdown=1 --allow_concurrent_memtable_write=${5:-0} --unordered_write=$4 --batch_size=1 --threads=$3 2>&1 | grep "fillrandom"
done;) | awk '{ t += $5; c++; print } END { printf ("%9.3f\n", 1.0 * t / c) }';

@cbi42 cbi42 force-pushed the concurrent-vector branch from a60ba39 to 4b569e4 Compare June 13, 2025 23:13
@cbi42 cbi42 marked this pull request as ready for review June 14, 2025 01:08
@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -698,6 +699,10 @@ class MemTable final : public ReadOnlyMemTable {
if (update_counters.num_range_deletes > 0) {
num_range_deletes_.fetch_add(update_counters.num_range_deletes,
std::memory_order_relaxed);
// noop for skip-list memtable
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds regression to the allow_concurrent_write=false case when uncommented, which I could not explain since BatchPostProcess is only called for concurrent inserts.

@cbi42 cbi42 requested review from pdillinger and hx235 and removed request for pdillinger June 14, 2025 04:40
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty awesome!

}

void delete_vector(void* ptr) {
std::vector<const char*>* v = static_cast<std::vector<const char*>*>(ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I like auto* v = static_cast<...*>(...) when simply assigning the result of static_cast.

@@ -103,16 +107,32 @@ class VectorRep : public MemTableRep {
using Bucket = std::vector<const char*>;
std::shared_ptr<Bucket> bucket_;
mutable port::RWMutex rwlock_;
RelaxedAtomic<size_t> bucket_size_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be a false sharing situation where updates to the bucket_size_ hurt the efficiency of the rwlock

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated the counter to be cacheline aligned.

}

void VectorRep::InsertConcurrently(KeyHandle handle) {
void* v = tl_writes_.Get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to me to put the static_cast before assigning to v from void*, rather than on use of v.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here and other places.

void VectorRep::InsertConcurrently(KeyHandle handle) {
void* v = tl_writes_.Get();
if (!v) {
v = new std::vector<const char*>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience std::deque is faster as an accumulator, because there's no need to copy existing written entries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. However, I don't see a noticeable performance difference when running benchmark with 100 and 1000 as batch size yet. I added a TlBucket type to make it easier to switch the accumulator type later.

v->clear();
v->shrink_to_fit();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm there could be a small risk of a small de facto leak into thread locals here (number of threads ever inserting into this memtable * sizeof vector), but that's probably ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated here to just free the vector instead. It doesn't show a noticeable performance difference.

sizeof(std::remove_reference<decltype(*bucket_)>::type::value_type);
}

void VectorRep::BatchPostProcess() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Do you know why one of the WriteBatchInternal::InsertInto() overloads doesn't call inserter.PostProcess()? Seems suspicious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that overload takes a WriteGroup and is called for the case when memtable writes are not done in parallel. Let me make it more explicit.

@cbi42 cbi42 force-pushed the concurrent-vector branch from 4b569e4 to 6491ead Compare June 18, 2025 06:13
@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Member Author

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

sizeof(std::remove_reference<decltype(*bucket_)>::type::value_type);
}

void VectorRep::BatchPostProcess() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that overload takes a WriteGroup and is called for the case when memtable writes are not done in parallel. Let me make it more explicit.

@@ -103,16 +107,32 @@ class VectorRep : public MemTableRep {
using Bucket = std::vector<const char*>;
std::shared_ptr<Bucket> bucket_;
mutable port::RWMutex rwlock_;
RelaxedAtomic<size_t> bucket_size_;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated the counter to be cacheline aligned.

v->clear();
v->shrink_to_fit();
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated here to just free the vector instead. It doesn't show a noticeable performance difference.

void VectorRep::InsertConcurrently(KeyHandle handle) {
void* v = tl_writes_.Get();
if (!v) {
v = new std::vector<const char*>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. However, I don't see a noticeable performance difference when running benchmark with 100 and 1000 as batch size yet. I added a TlBucket type to make it easier to switch the accumulator type later.

}

void VectorRep::InsertConcurrently(KeyHandle handle) {
void* v = tl_writes_.Get();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here and other places.

@facebook-github-bot
Copy link
Contributor

@cbi42 merged this pull request in c8aafdb.

SoujanyaPonnapalli pushed a commit to tyler-griggs/rocksdb that referenced this pull request Jul 20, 2025
Summary:
Some usage of vector memtable is bottlenecked in the memtable insertion path when using multiple writers. This PR adds support for concurrent writes for the vector memtable. The updates from each concurrent writer are buffered in a thread local vector. When a writer is done, MemTable::BatchPostProcess() is called to flush the thread local updates to the main vector. TSAN test and function comment suggest that ApproximateMemoryUsage() needs to be thread-safe, so its implementation is updated to provide thread-safe access.

Together with unordered_write, benchmark shows much improved insertion throughput.

Pull Request resolved: facebook#13675

Test Plan:
- new unit test
- enabled some coverage of vector memtable in stress test
- Performance benchmark: benchmarked memtable insertion performance with by running fillrandom 20 times
  - Compare branch and main performance with one thread and write batch size 100:
    - main: 4896888.950 ops/sec
    - branch: 4923366.350 ops/sec
  - Benchmark this branch by configuring different threads, allow_concurrent_memtable_write, and unordered_write. Performance ratio is computed as current ops/sec divided by ops/sec at 1 thread with the same options.

allow_concurrent | unordered_write | Threads | ops/sec | Performance Ratio
-- | -- | -- | -- | --
0 | 0 | 1 | 4923367 | 1.0
0 | 0 | 2 | 5215640 | 1.1
0 | 0 | 4 | 5588510 | 1.1
0 | 0 | 8 | 6077525 | 1.2
1 | 0 | 1 | 4919060 | 1.0
1 | 0 | 2 | 5821922 | 1.2
1 | 0 | 4 | 7850395 | 1.6
1 | 0 | 8 | 10516600 | 2.1
1 | 1 | 1 | 5050004 | 1.0
1 | 1 | 2 | 8489834 | 1.7
1 | 1 | 4 | 14439513 | 2.9
1 | 1 | 8 | 21538098 | 4.3

```
mkdir -p /tmp/bench_$1
export TEST_TMPDIR=/tmp/bench_$1

memtablerep_value=${6:-vector}

(for I in $(seq 1 $2)
do
	/data/users/changyubi/vscode-root/rocksdb/$1 --benchmarks=fillrandom --seed=1722808058 --write_buffer_size=67108864 --min_write_buffer_number_to_merge=1000 --max_write_buffer_number=1000 --enable_pipelined_write=0 --memtablerep=$memtablerep_value --disable_auto_compactions=1 --disable_wal=1 --avoid_flush_during_shutdown=1 --allow_concurrent_memtable_write=${5:-0} --unordered_write=$4 --batch_size=1 --threads=$3 2>&1 | grep "fillrandom"
done;) | awk '{ t += $5; c++; print } END { printf ("%9.3f\n", 1.0 * t / c) }';
```

Reviewed By: pdillinger

Differential Revision: D76641755

Pulled By: cbi42

fbshipit-source-id: c107ba42749855ad4fd1f52491eb93900757542e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants