GitHub user LiuQhahah edited a discussion: Cuckoo Filter Design Proposal
# 1. Introduction In https://github.com/apache/kvrocks/issues/2534, we plan to support the cuckoo filter. This document outlines the design for implementing a Cuckoo Filter in kvrocks. The Cuckoo Filter is a high-performance, probabilistic data structure used for approximate set membership testing. It allows for fast checks to see if an item is in a set and, unlike Bloom Filters, supports item deletion. A key requirement for this implementation is to support the counting of duplicate items (via the CF.COUNT command). To robustly handle scenarios where a single item may be inserted many more times than the capacity of its two candidate buckets, we have chosen a sharded (or chained) expansion strategy. This approach ensures functional correctness and system stability under all conditions, accepting a trade-off in query performance degradation in favor of feature completeness and robust handling of high-frequency items. # 2. Core Concepts * Fingerprint: Instead of storing the items themselves, the Cuckoo Filter stores a "fingerprint" of the item, which is a small hash derived from the original item. * Two Candidate Buckets: For any given item, it can be stored in one of two possible buckets within a given sub-filter. The locations of these buckets are determined by two hash functions. * Kicking: When a new item needs to be inserted and both of its candidate buckets are full, the filter will evict an existing item from one of the buckets to make space. This evicted item is then re-inserted into its alternate location, which may cause a cascade of evictions. This process continues until all items find a home or a maximum number of evictions is reached. * Expansion (Chaining): If the kicking process fails to find a home for an item after a set number of attempts, the current sub-filter is considered full. An expansion is then triggered, which appends a new, larger sub-filter to the chain to accommodate more data. This design does not use global rehashing. # 3. Data Structure Design To efficiently implement the Cuckoo Filter in kvrocks, we will adopt a "Bucket as Key" storage model, combined with a sharded (or chained) expansion strategy. This model maps each bucket of each sub-filter to an individual key-value pair in RocksDB, ensuring high I/O performance and cache efficiency. ## 3.1. Metadata The metadata stores the overall configuration for the Cuckoo Filter chain. Each Cuckoo Filter instance corresponds to a single metadata key-value pair in RocksDB. * Key Format: namespace:cuckoo_filter:{user_key} * Value Format: A serialized CuckooFilterMetadata object. ### CuckooFilterMetadata Structure | Field Name | Data Type | Size (Bytes) | Description | |---|---|---|---| | n_filters | uint32_t | 4 | The number of sub-filters in the chain. | | capacities | uint64_t[] | Variable | An array storing the bucket count for each sub-filter. capacities[i] is the number of buckets for the sub-filter at filter_index=i. | | size | uint64_t | 8 | The total number of items stored in the entire filter chain. | | max_iterations | uint16_t | 2 | The maximum number of kicks (evictions) allowed during an insertion before triggering an expansion.Default: 20 | | bucket_size | uint8_t | 1 | The number of fingerprints (slots) each bucket can hold. Default: 2 | | expansion | uint8_t | 1 | The growth factor for the capacity of a new sub-filter relative to the previous one. Default: 1 | ## 3.2. Bucket Data Each bucket within each sub-filter is stored as a separate key-value pair. This granular approach is the key to performance. * Key Format: cf:{namespace_key}:{filter_index}:{bucket_index} * cf: A fixed prefix to identify Cuckoo Filter data. * {namespace_key}: The same user key as used for the metadata. * {filter_index}: The 0-based index of the sub-filter in the chain. * {bucket_index}: The 0-based index of the bucket within that sub-filter. * Value Format: * A binary string containing the concatenated fingerprints for that bucket. This is a very small value (e.g., 4 bytes). ## 3.3. Storage Layout in RocksDB The following diagram illustrates the storage layout for a Cuckoo Filter that has expanded once (n_filters = 2). It details both the Metadata entry and the structure of the individual "Bucket as Key" entries. ``` +-----------------------------------------------------------------------------------+ | RocksDB Storage (Bucket as Key + Chaining) | +-----------------------------------------------------------------------------------+ | | | +---------------------------------------------------------------------------+ | | | Metadata Entry | | | | Key: "namespace:cuckoo_filter:my_filter" | | | | Value: (Serialized Data) | | | | { | | | | "n_filters": 2, | | | | "capacities": [1024, 2048], // -> Filter 0 has 1024 buckets, | | | | // Filter 1 has 2048 buckets. | | | | "size": 1500, | | | | "max_iterations": 20, | | | | "bucket_size": 4, // -> Each bucket holds 4 fingerprints | | | | "expansion": 2, | | | | } | | | +---------------------------------------------------------------------------+ | | | | +---------------------------------------------------------------------------+ | | | Buckets for Filter #0 | | | | | | | | Key: "cf:...:my_filter:0:123" // Example bucket from Filter #0 | | | | | | | | Value: (4 bytes of binary data) | | | | +----------+----------+----------+----------+ | | | | | fpA | fpB | fpC | (empty) | | | | | +----------+----------+----------+----------+ | | | | | Slot 0 | Slot 1 | Slot 2 | Slot 3 | | | | | (1 byte) (1 byte) (1 byte) (1 byte) | | | | | | | +---------------------------------------------------------------------------+ | | | | +---------------------------------------------------------------------------+ | | | Buckets for Filter #1 | | | | | | | | Key: "cf:...:my_filter:1:456" // Example bucket from Filter #1 | | | | | | | | Value: (4 bytes of binary data) | | | | +----------+----------+----------+----------+ | | | | | fpX | fpY | fpX | fpZ | | | | | +----------+----------+----------+----------+ | | | | | Slot 0 | Slot 1 | Slot 2 | Slot 3 | | | | | | | | +---------------------------------------------------------------------------+ | | | +-----------------------------------------------------------------------------------+ ``` # 4. Operational Flow ## 4.1. Add Operation (CF.ADD) 1. Read Metadata: Read the metadata for the given user_key to get n_filters and the capacities array. 2. Target Last Filter: Target the last sub-filter, which has an index of n_filters - 1. 3. Calculate Buckets: Based on the parameters of the last sub-filter (i.e., capacities[n_filters - 1]), calculate the two candidate bucket indices, i1 and i2, for the item. 4. Read Buckets: Use MultiGet to read the two corresponding bucket keys: ...:{n_filters-1}:{i1} and ...:{n_filters-1}:{i2}. 5. Attempt Insert: Try to place the item's fingerprint into an empty slot in either bucket. If successful, proceed to step 7. 6. Cuckoo Kicking: If both buckets are full, initiate the kicking process within the current sub-filter. 7. On Success: If the insertion is successful, increment the size in the metadata. Atomically write the updated metadata and all modified bucket data to RocksDB using a single WriteBatch. 8. On Failure: If the kicking process exceeds max_iterations, trigger the Expansion flow. ## 4.1.1. Internal Expansion Flow 1. Update Metadata: Calculate the capacity for the new sub-filter based on the expansion factor and append it to the capacities array. 2. Create New Filter: Write the updated metadata to a WriteBatch and commit it. This logically creates the new, empty sub-filter. 3. Retry Insert: Re-attempt the original ADD operation. It will now naturally target the newly created empty sub-filter and succeed. ## 4.3. Exists Operation (CF.EXISTS) 1. Read Metadata: Read the metadata to get n_filters and capacities. 2. Iterate Through Filters: Loop from filter_index = 0 to n_filters - 1. 3. Inside the loop, for each filter_index: * Calculate the two candidate bucket indices (i1, i2) for the item based on the current sub-filter's capacity (capacities[filter_index]). * Use MultiGet to read the two bucket keys: ...:{filter_index}:{i1} and ...:{filter_index}:{i2}. * If the item's fingerprint is found in either bucket, return `True` immediately and terminate the operation. 4. Return Result: If the loop completes without finding the item, return False. ## 4.4. Count Operation (CF.COUNT) This operation is the primary reason for choosing the chained expansion strategy, as it correctly handles high-frequency duplicate items. 1. Read Metadata: Read the metadata to get n_filters and capacities. 2. Initialize Counter: Set total_count = 0. 3. Iterate Through Filters: * Loop from filter_index = 0 to n_filters - 1. * Inside the loop, for each filter_index: * Calculate the two candidate bucket indices (i1, i2) for the item based on the current sub-filter's capacity. * Use MultiGet to read the two corresponding buckets. * In memory, iterate through the contents of both buckets and add the number of matching fingerprints to total_count. 4. Return Result: After the loop completes, return the final total_count. ## 4.5. Delete Operation (CF.DELETE) The delete operation is similar to EXISTS but only needs to find and remove the first occurrence of the item. 1. Read Metadata and Iterate Through Filters as in EXISTS. 2. In the loop, upon finding the first matching fingerprint in a bucket: * Remove the fingerprint from the bucket's value data. * Decrement the size in the metadata. * Atomically write the modified bucket data and the updated metadata to a WriteBatch. * Return `True` immediately, and terminate the operation. 3. Return Result: If the loop completes without finding the item, return False. # 5. Disadvantages and Trade-offs This design is a pragmatic compromise, and it's important to understand its inherent limitations. * Query Performance Degradation: The most significant trade-off is that the performance of read operations (EXISTS, COUNT, DELETE) degrades linearly as the number of sub-filters (n_filters) increases. Each expansion adds another set of lookups to every query. * Low Utilization of Older Sub-Filters: The "write-to-tail" strategy means that once a sub-filter triggers an expansion, it becomes "frozen." It will never receive new insertions, even if space is freed up by DELETE operations. This can lead to poor overall space utilization, as older sub-filters may become sparse over time while new ones are created. * Hotspot-Driven Expansion: As identified, frequent insertion of a single, high-cardinality item can rapidly trigger multiple expansions. This creates numerous sub-filters that are largely dedicated to holding this single item, further exacerbating the space utilization problem. To mitigate this, it is highly recommended to use a larger `bucket_size` (e.g., 4 or ) instead of the default of 2 if this usage pattern is anticipated. A larger bucket size increases the capacity for duplicate items within a single sub-filter, delaying the need for expansion. # 6. Concurrency Safety * Atomic Writes: All write operations must be atomic to ensure data consistency. This is achieved by using RocksDB's WriteBatch. * Locking: To prevent race conditions during a read-modify-write cycle, a lock should be acquired at the application level for the user key. This is especially critical during expansion to ensure that concurrent operations do not interfere with each other. # 7. Reference 1. https://redis.io/docs/latest/develop/data-types/probabilistic/cuckoo-filter/ 2. https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf GitHub link: https://github.com/apache/kvrocks/discussions/3079 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
