GitHub user LiuQhahah added a comment to the discussion: Cuckoo Filter Design Proposal
> I still cannot get your idea. > If we want 1+2+4=7MB of storage to store buckets, why not to create 7 filters > (1MB each)? Yes, if we want a 7MB filter, the cuckoofilter will create a `filter 0` with a capacity of 7MB. I understand that the issue discussed in is https://github.com/apache/kvrocks/discussions/3079#discussioncomment-13918799 about the **expansion scenario**. For example, when executing CF.RESERVE key capacity [BUCKETSIZE bucketsize] [MAXITERATIONS maxiterations] [EXPANSION expansion], setting capacity to 1000, bucketsize to 2, maxiterations to 20, and expansion to 2, `filter 0` will be created, and 512 buckets will be created. When continuously inserting items, if a new item is inserted and the cuckoofilter fails to find an empty bucket after 20 attempts of insertion, the expansion mechanism will be triggered, and `filter 1` will be created. Since expansion is equal to 2, the capacity of `filter 1` will become 2048. At this point, `filter 0 > bucket0, bucket1, ..., bucket511` , `filter 1 > bucket0, bucket1, ..., bucket1023`. > What is the meaning of "sub-filter"s and a more complicated indexing? Why not > just filter0, filter1, filter2, ... filter6? Sub-filter this concept comes from the https://github.com/RedisBloom/RedisBloom/blob/master/src/cuckoo.h#L32C1-L37C1 . For the `"1 + 2 + 4 = 7 MB"` For the scenario (`expansion=2`), I expect to represent it using `filter0, filter1, filter2`, rather than `filter0, filter1, filter2,... filter6`, this is because `expansion=2` instead of `1` > It should be a big trouble if we don't have same-size filters. >we will have a more complicated indexing like 2:1:2:3:0... I really don't >think it will work well. > with more and more rehashing happen, the size of one rocksdb value will > become larger and larger. (as you know, rocksdb key-values should be in a > relatively small size. except in blob mode) >we cannot control it in a more flexible way, e.g. we cannot adjust the size of >filter in every rehashing. I fully understand your concern. Use `the same-size filter`. This is the case when the expansion is 1. For expansions of 2 or greater, in order to avoid the filter growing infinitely, we can use a threshold setting. For a cuckoo filter with an expansion of 1, the space growth during expansion is not overly dramatic (merely a copy of the size of filter0). However, for expansions with a value of 2 or greater, the cuckoo filter expands exponentially. We can set a threshold so that when cuckoofilter has been expanded N times and its capacity exceeds this threshold, we can use multiple keys to store `'filter N'` in RocksDB. For instance, if the threshold is set to 1MB and the capacity of cuckoofilter reaches 2MB after the Nth expansion, we can use two RocksDB keys to store the expanded `"filter N"`? This is the current index, `cf:{namespace_key}:{filter_index}:{chunk_index}`, which has one more level than the original index, `{chunk_index}`. This can ensure the maximum number of items stored in RocksDB. > For example, in the first round, we have one filter with 16 buckets. e.g. > filter0 -> bucket0, ... bucket15. > And after one rehashing, we can just extend it to a filter with 32 buckets, > e.g. filter0 -> bucket0 ..., bucket31, since it is still relatively small and > we don't need to store it in two keys. > In your design it seems no possible 🤔 Yes, it's obvious that for those small-capacity cuckoofilters, such as those with a capacity of only 1024 and an expansion of 2, resulting in 2048 after expansion, using two RocksDB keys to store the second filter seems very wasteful of resources. **In summary** For the case of expansion=1, after the expansion, the subsequent filter1->filterN remain of the same size as filter0 However, in cases where expansion is greater than 1, the cf expansion will show an exponential growth. At this point, we can control the infinite growth of the filter by setting a threshold. Once the threshold is exceeded, the filterN will be split into multiple blocks and stored in rocksDB. At this point, the index will become 'cf:{namespace_key}:{filter_index}:{chunk_index}', and the insertion, deletion, and search will be conducted in RocksDB based on this index GitHub link: https://github.com/apache/kvrocks/discussions/3079#discussioncomment-13920092 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
