Skip to content

Releases: facebook/rocksdb

RocksDB 8.3.3

01 Sep 20:48
Compare
Choose a tag to compare

8.3.3 (2023-09-01)

Bug Fixes

  • Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
  • Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
  • Fixed a race condition in GenericRateLimiter that could cause it to stop granting requests

RocksDB 8.3.2

23 Jun 23:45
Compare
Choose a tag to compare

8.3.2 (2023-06-14)

Bug Fixes

  • Reduced cases of illegally using Env::Default() during static destruction by never destroying the internal PosixEnv itself (except for builds checking for memory leaks). (#11538)

8.3.1 (2023-06-07)

Performance Improvements

  • Fixed higher read QPS during DB::Open() reading files created prior to #11406, especially when reading many small file (size < 52 MB) during DB::Open() and partitioned filter or index is used.

8.3.0 (2023-05-19)

New Features

  • Introduced a new option block_protection_bytes_per_key, which can be used to enable per key-value integrity protection for in-memory blocks in block cache (#11287).
  • Added JemallocAllocatorOptions::num_arenas. Setting num_arenas > 1 may mitigate mutex contention in the allocator, particularly in scenarios where block allocations commonly bypass jemalloc tcache.
  • Improve the operational safety of publishing a DB or SST files to many hosts by using different block cache hash seeds on different hosts. The exact behavior is controlled by new option ShardedCacheOptions::hash_seed, which also documents the solved problem in more detail.
  • Introduced a new option CompactionOptionsFIFO::file_temperature_age_thresholds that allows FIFO compaction to compact files to different temperatures based on key age (#11428).
  • Added a new ticker stat to count how many times RocksDB detected a corruption while verifying a block checksum: BLOCK_CHECKSUM_MISMATCH_COUNT.
  • New statistics rocksdb.file.read.db.open.micros that measures read time of block-based SST tables or blob files during db open.
  • New statistics tickers for various iterator seek behaviors and relevant filtering, as *_LEVEL_SEEK_*. (#11460)

Public API Changes

  • EXPERIMENTAL: Add new API DB::ClipColumnFamily to clip the key in CF to a certain range. It will physically deletes all keys outside the range including tombstones.
  • Add MakeSharedCache() construction functions to various cache Options objects, and deprecated the NewWhateverCache() functions with long parameter lists.
  • Changed the meaning of various Bloom filter stats (prefix vs. whole key), with iterator-related filtering only being tracked in the new *_LEVEL_SEEK_*. stats. (#11460)

Behavior changes

  • For x86, CPU features are no longer detected at runtime nor in build scripts, but in source code using common preprocessor defines. This will likely unlock some small performance improvements on some newer hardware, but could hurt performance of the kCRC32c checksum, which is no longer the default, on some "portable" builds. See PR #11419 for details.

Bug Fixes

  • Delete an empty WAL file on DB open if the log number is less than the min log number to keep
  • Delete temp OPTIONS file on DB open if there is a failure to write it out or rename it

Performance Improvements

  • Improved the I/O efficiency of prefetching SST metadata by recording more information in the DB manifest. Opening files written with previous versions will still rely on heuristics for how much to prefetch (#11406).

RocksDB 8.1.1

20 Apr 22:02
Compare
Choose a tag to compare

8.1.1 (2023-04-06)

Bug Fixes

  • In the DB::VerifyFileChecksums API, ensure that file system reads of SST files are equal to the readahead_size in ReadOptions, if specified. Previously, each read was 2x the readahead_size.

8.1.0 (2023-03-18)

Behavior changes

  • Compaction output file cutting logic now considers range tombstone start keys. For example, SST partitioner now may receive ParitionRequest for range tombstone start keys.
  • If the async_io ReadOption is specified for MultiGet or NewIterator on a platform that doesn't support IO uring, the option is ignored and synchronous IO is used.

Bug Fixes

  • Fixed an issue for backward iteration when user defined timestamp is enabled in combination with BlobDB.
  • Fixed a couple of cases where a Merge operand encountered during iteration wasn't reflected in the internal_merge_count PerfContext counter.
  • Fixed a bug in CreateColumnFamilyWithImport()/ExportColumnFamily() which did not support range tombstones (#11252).
  • Fixed a bug where an excluded column family from an atomic flush contains unflushed data that should've been included in this atomic flush (i.e, data of seqno less than the max seqno of this atomic flush), leading to potential data loss in this excluded column family when WriteOptions::disableWAL == true (#11148).

New Features

  • Add statistics rocksdb.secondary.cache.filter.hits, rocksdb.secondary.cache.index.hits, and rocksdb.secondary.cache.filter.hits
  • Added a new PerfContext counter internal_merge_point_lookup_count which tracks the number of Merge operands applied while serving point lookup queries.
  • Add new statistics rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit}
  • Add support for SecondaryCache with HyperClockCache (HyperClockCacheOptions inherits secondary_cache option from ShardedCacheOptions)
  • Add new db properties rocksdb.cf-write-stall-stats, rocksdb.db-write-stall-statsand APIs to examine them in a structured way. In particular, users of GetMapProperty() with property kCFWriteStallStats/kDBWriteStallStats can now use the functions in WriteStallStatsMapKeys to find stats in the map.

Public API Changes

  • Changed various functions and features in Cache that are mostly relevant to custom implementations or wrappers. Especially, asychronous lookup functionality is moved from Lookup() to a new StartAsyncLookup() function.

RocksDB 7.10.2

02 Mar 01:00
Compare
Choose a tag to compare

7.10.2 (2023-02-10)

Bug Fixes

  • Fixed a bug in DB open/recovery from a compressed WAL that was caused due to incorrect handling of certain record fragments with the same offset within a WAL block.

7.10.1 (2023-02-01)

Bug Fixes

  • Fixed a data race on ColumnFamilyData::flush_reason caused by concurrent flushes.
  • Fixed DisableManualCompaction() and CompactRangeOptions::canceled to cancel compactions even when they are waiting on conflicting compactions to finish
  • Fixed a bug in which a successful GetMergeOperands() could transiently return Status::MergeInProgress()
  • Return the correct error (Status::NotSupported()) to MultiGet caller when ReadOptions::async_io flag is true and IO uring is not enabled. Previously, Status::Corruption() was being returned when the actual failure was lack of async IO support.

7.10.0 (2023-01-23)

Behavior changes

  • Make best-efforts recovery verify SST unique ID before Version construction (#10962)
  • Introduce epoch_number and sort L0 files by epoch_number instead of largest_seqno. epoch_number represents the order of a file being flushed or ingested/imported. Compaction output file will be assigned with the minimum epoch_number among input files'. For L0, larger epoch_number indicates newer L0 file.

Bug Fixes

  • Fixed a regression in iterator where range tombstones after iterate_upper_bound is processed.
  • Fixed a memory leak in MultiGet with async_io read option, caused by IO errors during table file open
  • Fixed a bug that multi-level FIFO compaction deletes one file in non-L0 even when CompactionOptionsFIFO::max_table_files_size is no exceeded since #10348 or 7.8.0.
  • Fixed a bug caused by DB::SyncWAL() affecting track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#10892).
  • Fixed a BackupEngine bug in which RestoreDBFromLatestBackup would fail if the latest backup was deleted and there is another valid backup available.
  • Fix L0 file misorder corruption caused by ingesting files of overlapping seqnos with memtable entries' through introducing epoch_number. Before the fix, force_consistency_checks=true may catch the corruption before it's exposed to readers, in which case writes returning Status::Corruption would be expected. Also replace the previous incomplete fix (#5958) to the same corruption with this new and more complete fix.
  • Fixed a bug in LockWAL() leading to re-locking mutex (#11020).
  • Fixed a heap use after free bug in async scan prefetching when the scan thread and another thread try to read and load the same seek block into cache.
  • Fixed a heap use after free in async scan prefetching if dictionary compression is enabled, in which case sync read of the compression dictionary gets mixed with async prefetching
  • Fixed a data race bug of CompactRange() under change_level=true acts on overlapping range with an ongoing file ingestion for level compaction. This will either result in overlapping file ranges corruption at a certain level caught by force_consistency_checks=true or protentially two same keys both with seqno 0 in two different levels (i.e, new data ends up in lower/older level). The latter will be caught by assertion in debug build but go silently and result in read returning wrong result in release build. This fix is general so it also replaced previous fixes to a similar problem for CompactFiles() (#4665), general CompactRange() and auto compaction (commit 5c64fb6 and 87dfc1d).
  • Fixed a bug in compaction output cutting where small output files were produced due to TTL file cutting states were not being updated (#11075).

New Features

  • When an SstPartitionerFactory is configured, CompactRange() now automatically selects for compaction any files overlapping a partition boundary that is in the compaction range, even if no actual entries are in the requested compaction range. With this feature, manual compaction can be used to (re-)establish SST partition points when SstPartitioner changes, without a full compaction.
  • Add BackupEngine feature to exclude files from backup that are known to be backed up elsewhere, using CreateBackupOptions::exclude_files_callback. To restore the DB, the excluded files must be provided in alternative backup directories using RestoreOptions::alternate_dirs.

Public API Changes

  • Substantial changes have been made to the Cache class to support internal development goals. Direct use of Cache class members is discouraged and further breaking modifications are expected in the future. SecondaryCache has some related changes and implementations will need to be updated. (Unlike Cache, SecondaryCache is still intended to support user implementations, and disruptive changes will be avoided.) (#10975)
  • Add MergeOperationOutput::op_failure_scope for merge operator users to control the blast radius of merge operator failures. Existing merge operator users do not need to make any change to preserve the old behavior

Performance Improvements

  • Updated xxHash source code, which should improve kXXH3 checksum speed, at least on ARM (#11098).
  • Improved CPU efficiency of DB reads, from block cache access improvements (#10975).

RocksDB 8.0.0

18 Mar 00:15
Compare
Choose a tag to compare

8.0.0 (02/19/2023)

Behavior changes

  • ReadOptions::verify_checksums=false disables checksum verification for more reads of non-CacheEntryRole::kDataBlock blocks.
  • In case of scan with async_io enabled, if posix doesn't support IOUring, Status::NotSupported error will be returned to the users. Initially that error was swallowed and reads were switched to synchronous reads.

Bug Fixes

  • Fixed a data race on ColumnFamilyData::flush_reason caused by concurrent flushes.
  • Fixed an issue in Get and MultiGet when user-defined timestamps is enabled in combination with BlobDB.
  • Fixed some atypical behaviors for LockWAL() such as allowing concurrent/recursive use and not expecting UnlockWAL() after non-OK result. See API comments.
  • Fixed a feature interaction bug where for blobs GetEntity would expose the blob reference instead of the blob value.
  • Fixed DisableManualCompaction() and CompactRangeOptions::canceled to cancel compactions even when they are waiting on conflicting compactions to finish
  • Fixed a bug in which a successful GetMergeOperands() could transiently return Status::MergeInProgress()
  • Return the correct error (Status::NotSupported()) to MultiGet caller when ReadOptions::async_io flag is true and IO uring is not enabled. Previously, Status::Corruption() was being returned when the actual failure was lack of async IO support.
  • Fixed a bug in DB open/recovery from a compressed WAL that was caused due to incorrect handling of certain record fragments with the same offset within a WAL block.

Feature Removal

  • Remove RocksDB Lite.
  • The feature block_cache_compressed is removed. Statistics related to it are removed too.
  • Remove deprecated Env::LoadEnv(). Use Env::CreateFromString() instead.
  • Remove deprecated FileSystem::Load(). Use FileSystem::CreateFromString() instead.
  • Removed the deprecated version of these utility functions and the corresponding Java bindings: LoadOptionsFromFile, LoadLatestOptions, CheckOptionsCompatibility.
  • Remove the FactoryFunc from the LoadObject method from the Customizable helper methods.

Public API Changes

  • Moved rarely-needed Cache class definition to new advanced_cache.h, and added a CacheWrapper class to advanced_cache.h. Minor changes to SimCache API definitions.
  • Completely removed the following deprecated/obsolete statistics: the tickers BLOCK_CACHE_INDEX_BYTES_EVICT, BLOCK_CACHE_FILTER_BYTES_EVICT, BLOOM_FILTER_MICROS, NO_FILE_CLOSES, STALL_L0_SLOWDOWN_MICROS, STALL_MEMTABLE_COMPACTION_MICROS, STALL_L0_NUM_FILES_MICROS, RATE_LIMIT_DELAY_MILLIS, NO_ITERATORS, NUMBER_FILTERED_DELETES, WRITE_TIMEDOUT, BLOB_DB_GC_NUM_KEYS_OVERWRITTEN, BLOB_DB_GC_NUM_KEYS_EXPIRED, BLOB_DB_GC_BYTES_OVERWRITTEN, BLOB_DB_GC_BYTES_EXPIRED, BLOCK_CACHE_COMPRESSION_DICT_BYTES_EVICT as well as the histograms STALL_L0_SLOWDOWN_COUNT, STALL_MEMTABLE_COMPACTION_COUNT, STALL_L0_NUM_FILES_COUNT, HARD_RATE_LIMIT_DELAY_COUNT, SOFT_RATE_LIMIT_DELAY_COUNT, BLOB_DB_GC_MICROS, and NUM_DATA_BLOCKS_READ_PER_LEVEL. Note that as a result, the C++ enum values of the still supported statistics have changed. Developers are advised to not rely on the actual numeric values.
  • Deprecated IngestExternalFileOptions::write_global_seqno and change default to false. This option only needs to be set to true to generate a DB compatible with RocksDB versions before 5.16.0.
  • Remove deprecated APIs GetColumnFamilyOptionsFrom{Map|String}(const ColumnFamilyOptions&, ..), GetDBOptionsFrom{Map|String}(const DBOptions&, ..), GetBlockBasedTableOptionsFrom{Map|String}(const BlockBasedTableOptions& table_options, ..) and GetPlainTableOptionsFrom{Map|String}(const PlainTableOptions& table_options,..).
  • Added a subcode of Status::Corruption, Status::SubCode::kMergeOperatorFailed, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions

Build Changes

  • The make build now builds a shared library by default instead of a static library. Use LIB_MODE=static to override.

New Features

  • Compaction filters are now supported for wide-column entities by means of the FilterV3 API. See the comment of the API for more details.
  • Added do_not_compress_roles to CompressedSecondaryCacheOptions to disable compression on certain kinds of block. Filter blocks are now not compressed by CompressedSecondaryCache by default.
  • Added a new MultiGetEntity API that enables batched wide-column point lookups. See the API comments for more details.

RocksDB 7.9.2

17 Jan 18:51
Compare
Choose a tag to compare

7.9.2 (2022-12-21)

Bug Fixes

  • Fixed a heap use after free bug in async scan prefetching when the scan thread and another thread try to read and load the same seek block into cache.

7.9.1 (2022-12-08)

Bug Fixes

  • Fixed a regression in iterator where range tombstones after iterate_upper_bound is processed.
  • Fixed a memory leak in MultiGet with async_io read option, caused by IO errors during table file open

Behavior changes

  • Make best-efforts recovery verify SST unique ID before Version construction (#10962)

7.9.0 (2022-11-21)

Performance Improvements

  • Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

Bug Fixes

  • Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.
  • Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as Status::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).
  • Fixed an issue where the READ_NUM_MERGE_OPERANDS ticker was not updated when the base key-value or tombstone was read from an SST file.
  • Fixed a memory safety bug when using a SecondaryCache with block_cache_compressed. block_cache_compressed no longer attempts to use SecondaryCache features.
  • Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.
  • Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.

New Features

  • Add basic support for user-defined timestamp to Merge (#10819).
  • Add stats for ReadAsync time spent and async read errors.
  • Basic support for the wide-column data model is now available. Wide-column entities can be stored using the PutEntity API, and retrieved using GetEntity and the new columns API of iterator. For compatibility, the classic APIs Get and MultiGet, as well as iterator's value API return the value of the anonymous default column of wide-column entities; also, GetEntity and iterator's columns return any plain key-values in the form of an entity which only has the anonymous default column. Merge (and GetMergeOperands) currently also apply to the default column; any other columns of entities are unaffected by Merge operations. Note that some features like compaction filters, transactions, user-defined timestamps, and the SST file writer do not yet support wide-column entities; also, there is currently no MultiGet-like API to retrieve multiple entities at once. We plan to gradually close the above gaps and also implement new features like column-level operations (e.g. updating or querying only certain columns of an entity).
  • Marked HyperClockCache as a production-ready alternative to LRUCache for the block cache. HyperClockCache greatly improves hot-path CPU efficiency under high parallel load or high contention, with some documented caveats and limitations. As much as 4.5x higher ops/sec vs. LRUCache has been seen in db_bench under high parallel load.
  • Add periodic diagnostics to info_log (LOG file) for HyperClockCache block cache if performance is degraded by bad estimated_entry_charge option.

Public API Changes

  • Marked block_cache_compressed as a deprecated feature. Use SecondaryCache instead.
  • Added a SecondaryCache::InsertSaved() API, with default implementation depending on Insert(). Some implementations might need to add a custom implementation of InsertSaved(). (Details in API comments.)

RocksDB 7.8.3

15 Dec 18:56
Compare
Choose a tag to compare

7.8.3 (2022-11-29)

  • Revert an internal change in 7.8.0 associated with some memory usage churn.

7.8.2 (2022-11-27)

Behavior changes

  • Make best-efforts recovery verify SST unique ID before Version construction (#10962)
  • Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as Status::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).
  • Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.

Bug Fixes

  • Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.
  • Fixed a performance regression in iterator where range tombstones after iterate_upper_bound is processed.

7.8.1 (2022-11-02)

Bug Fixes

  • Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.

7.8.0 (2022-10-22)

New Features

  • DeleteRange() now supports user-defined timestamp.
  • Provide support for async_io with tailing iterators when ReadOptions.tailing is enabled during scans.
  • Tiered Storage: allow data moving up from the last level to the penultimate level if the input level is penultimate level or above.
  • Added DB::Properties::kFastBlockCacheEntryStats, which is similar to DB::Properties::kBlockCacheEntryStats, except returns cached (stale) values in more cases to reduce overhead.
  • FIFO compaction now supports migrating from a multi-level DB via DB::Open(). During the migration phase, FIFO compaction picker will:
  • picks the sst file with the smallest starting key in the bottom-most non-empty level.
  • Note that during the migration phase, the file purge order will only be an approximation of "FIFO" as files in lower-level might sometime contain newer keys than files in upper-level.
  • Added an option ignore_max_compaction_bytes_for_input to ignore max_compaction_bytes limit when adding files to be compacted from input level. This should help reduce write amplification. The option is enabled by default.
  • Tiered Storage: allow data moving up from the last level even if it's a last level only compaction, as long as the penultimate level is empty.
  • Add a new option IOOptions.do_not_recurse that can be used by underlying file systems to skip recursing through sub directories and list only files in GetChildren API.
  • Add option preserve_internal_time_seconds to preserve the time information for the latest data. Which can be used to determine the age of data when preclude_last_level_data_seconds is enabled. The time information is attached with SST in table property rocksdb.seqno.time.map which can be parsed by tool ldb or sst_dump.

Bug Fixes

  • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
  • Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
  • Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
  • Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).
  • Fixed a bug causing manual flush with flush_opts.wait=false to stall when database has stopped all writes (#10001).
  • Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
  • Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).
  • Fixed a memory safety bug in experimental HyperClockCache (#10768)
  • Fixed some cases where ldb update_manifest and ldb unsafe_remove_sst_file are not usable because they were requiring the DB files to match the existing manifest state (before updating the manifest to match a desired state).

Performance Improvements

  • Try to align the compaction output file boundaries to the next level ones, which can reduce more than 10% compaction load for the default level compaction. The feature is enabled by default, to disable, set AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size to false. As a side effect, it can create SSTs larger than the target_file_size (capped at 2x target_file_size) or smaller files.
  • Improve RoundRobin TTL compaction, which is going to be the same as normal RoundRobin compaction to move the compaction cursor.
  • Fix a small CPU regression caused by a change that UserComparatorWrapper was made Customizable, because Customizable itself has small CPU overhead for initialization.
  • Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

Behavior Changes

  • Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).

Public API changes

  • Make kXXH3 checksum the new default, because it is faster on common hardware, especially with kCRC32c affected by a performance bug in some versions of clang (#9891). DBs written with this new setting can be read by RocksDB 6.27 and newer.
  • Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Introduced an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. More details in rocksdb/includb/block_cache_trace_writer.h.

RocksDB 7.7.8

15 Dec 18:52
Compare
Choose a tag to compare

7.7.8 (2022-11-27)

Bug Fixes

  • Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as Status::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).
  • Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.
  • Fixed a regression in iterator where range tombstones after iterate_upper_bound is processed.

7.7.7 (2022-11-15)

Bug Fixes

  • Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.

7.7.6 (2022-11-03)

Bug Fixes

  • Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.

7.7.5 (2022-10-28)

Bug Fixes

  • Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

7.7.4 (2022-10-28)

Bug Fixes

  • Fixed a case of calling malloc_usable_size on result of operator new[].

RocksDB 7.7.3

12 Oct 21:58
Compare
Choose a tag to compare

7.7.3 (2022-10-11)

Bug Fixes

  • Fixed a memory safety bug in experimental HyperClockCache (#10768)

RocksDB 7.7.2

07 Oct 16:28
Compare
Choose a tag to compare

7.7.2 (2022-10-05)

Bug Fixes

  • Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
  • Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).

Behavior Changes

  • Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).

7.7.1 (2022-09-26)

Bug Fixes

  • Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
  • Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).

7.7.0 (2022-09-18)

Bug Fixes

  • Fixed a hang when an operation such as GetLiveFiles or CreateNewBackup is asked to trigger and wait for memtable flush on a read-only DB. Such indirect requests for memtable flush are now ignored on a read-only DB.
  • Fixed bug where FlushWAL(true /* sync */) (used by GetLiveFilesStorageInfo(), which is used by checkpoint and backup) could cause parallel writes at the tail of a WAL file to never be synced.
  • Fix periodic_task unable to re-register the same task type, which may cause SetOptions() fail to update periodical_task time like: stats_dump_period_sec, stats_persist_period_sec.
  • Fixed a bug in the rocksdb.prefetched.bytes.discarded stat. It was counting the prefetch buffer size, rather than the actual number of bytes discarded from the buffer.
  • Fix bug where the directory containing CURRENT can left unsynced after CURRENT is updated to point to the latest MANIFEST, which leads to risk of unsync data loss of CURRENT.
  • Update rocksdb.multiget.io.batch.size stat in non-async MultiGet as well.
  • Fix a bug in key range overlap checking with concurrent compactions when user-defined timestamp is enabled. User-defined timestamps should be EXCLUDED when checking if two ranges overlap.
  • Fixed a bug where the blob cache prepopulating logic did not consider the secondary cache (see #10603).
  • Fixed the rocksdb.num.sst.read.per.level, rocksdb.num.index.and.filter.blocks.read.per.level and rocksdb.num.level.read.per.multiget stats in the MultiGet coroutines
  • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
  • Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
  • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
  • Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.

Public API changes

  • Add rocksdb_column_family_handle_get_id, rocksdb_column_family_handle_get_name to get name, id of column family in C API
  • Add a new stat rocksdb.async.prefetch.abort.micros to measure time spent waiting for async prefetch reads to abort

Java API Changes

  • Add CompactionPriority.RoundRobin.
  • Revert to using the default metadata charge policy when creating an LRU cache via the Java API.

Behavior Change

  • DBOptions::verify_sst_unique_id_in_manifest is now an on-by-default feature that verifies SST file identity whenever they are opened by a DB, rather than only at DB::Open time.
  • Right now, when the option migration tool (OptionChangeMigration()) migrates to FIFO compaction, it compacts all the data into one single SST file and move to L0. This might create a problem for some users: the giant file may be soon deleted to satisfy max_table_files_size, and might cayse the DB to be almost empty. We change the behavior so that the files are cut to be smaller, but these files might not follow the data insertion order. With the change, after the migration, migrated data might not be dropped by insertion order by FIFO compaction.
  • When a block is firstly found from CompressedSecondaryCache, we just insert a dummy block into the primary cache and don’t erase the block from CompressedSecondaryCache. A standalone handle is returned to the caller. Only if the block is found again from CompressedSecondaryCache before the dummy block is evicted, we erase the block from CompressedSecondaryCache and insert it into the primary cache.
  • When a block is firstly evicted from the primary cache to CompressedSecondaryCache, we just insert a dummy block in CompressedSecondaryCache. Only if it is evicted again before the dummy block is evicted from the cache, it is treated as a hot block and is inserted into CompressedSecondaryCache.
  • Improved the estimation of memory used by cached blobs by taking into account the size of the object owning the blob value and also the allocator overhead if malloc_usable_size is available (see #10583).
  • Blob values now have their own category in the cache occupancy statistics, as opposed to being lumped into the "Misc" bucket (see #10601).
  • Change the optimize_multiget_for_io experimental ReadOptions flag to default on.

New Features

  • RocksDB does internal auto prefetching if it notices 2 sequential reads if readahead_size is not specified. New option num_file_reads_for_auto_readahead is added in BlockBasedTableOptions which indicates after how many sequential reads internal auto prefetching should be start (default is 2).
  • Added new perf context counters block_cache_standalone_handle_count, block_cache_real_handle_count,compressed_sec_cache_insert_real_count, compressed_sec_cache_insert_dummy_count, compressed_sec_cache_uncompressed_bytes, and compressed_sec_cache_compressed_bytes.
  • Memory for blobs which are to be inserted into the blob cache is now allocated using the cache's allocator (see #10628 and #10647).
  • HyperClockCache is an experimental, lock-free Cache alternative for block cache that offers much improved CPU efficiency under high parallel load or high contention, with some caveats. As much as 4.5x higher ops/sec vs. LRUCache has been seen in db_bench under high parallel load.
  • CompressedSecondaryCacheOptions::enable_custom_split_merge is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.

Performance Improvements

  • Iterator performance is improved for DeleteRange() users. Internally, iterator will skip to the end of a range tombstone when possible, instead of looping through each key and check individually if a key is range deleted.
  • Eliminated some allocations and copies in the blob read path. Also, PinnableSlice now only points to the blob value and pins the backing resource (cache entry or buffer) in all cases, instead of containing a copy of the blob value. See #10625 and #10647.
  • In case of scans with async_io enabled, few optimizations have been added to issue more asynchronous requests in parallel in order to avoid synchronous prefetching.
  • DeleteRange() users should see improvement in get/iterator performance from mutable memtable (see #10547).