GitHub user LiuQhahah added a comment to the discussion: Cuckoo Filter Design 
Proposal

> I still cannot get your idea.
> If we want 1+2+4=7MB of storage to store buckets, why not to create 7 filters 
> (1MB each)?

Yes, if we want a 7MB filter, the cuckoofilter will create a  `filter 0` with a 
capacity of 7MB. I understand that the issue discussed in  is 
https://github.com/apache/kvrocks/discussions/3079#discussioncomment-13918799 
about the **expansion scenario**. 

For example, when executing CF.RESERVE key capacity [BUCKETSIZE bucketsize] 
[MAXITERATIONS maxiterations] [EXPANSION expansion], setting capacity to 1000, 
bucketsize to 2, maxiterations to 20, and expansion to 2,  `filter 0` will be 
created, and 512 buckets will be created. 

When continuously inserting items, if a new item is inserted and the 
cuckoofilter fails to find an empty bucket after 20 attempts of insertion, the 
expansion mechanism will be triggered, and  `filter 1` will be created. Since 
expansion is equal to 2, the capacity of  `filter 1` will become 2048. 
At this point, `filter 0 > bucket0, bucket1, ..., bucket511` , `filter 1 >  
bucket0, bucket1, ..., bucket1023`.



> What is the meaning of "sub-filter"s and a more complicated indexing? Why not 
> just filter0, filter1, filter2, ... filter6?

Sub-filter this concept comes from the 
https://github.com/RedisBloom/RedisBloom/blob/master/src/cuckoo.h#L32C1-L37C1 . 
For the `"1 + 2 + 4 = 7 MB"` For the scenario (`expansion=2`), I expect to 
represent it using `filter0, filter1, filter2`, rather than `filter0, filter1, 
filter2,... filter6`, this is because `expansion=2` instead of `1`


> It should be a big trouble if we don't have same-size filters. 

>we will have a more complicated indexing like 2:1:2:3:0... I really don't 
>think it will work well.
> with more and more rehashing happen, the size of one rocksdb value will 
> become larger and larger. (as you know, rocksdb key-values should be in a 
> relatively small size. except in blob mode)
>we cannot control it in a more flexible way, e.g. we cannot adjust the size of 
>filter in every rehashing.


I fully understand your concern. 
Use `the same-size filter`. This is the case when the expansion is 1. For 
expansions of 2 or greater, in order to avoid the filter growing infinitely, we 
can use a threshold setting.

For a cuckoo filter with an expansion of 1, the space growth during expansion 
is not overly dramatic (merely a copy of the size of filter0). However, for 
expansions with a value of 2 or greater, the cuckoo filter expands 
exponentially. 




We can set a threshold so that when cuckoofilter has been expanded N times and 
its capacity exceeds this threshold, we can use multiple keys to store  
`'filter N'` in RocksDB.

For instance, if the threshold is set to 1MB and the capacity of cuckoofilter 
reaches 2MB after the Nth expansion, we can use two RocksDB keys to store the 
expanded `"filter N"`?  This is the current index, 
`cf:{namespace_key}:{filter_index}:{chunk_index}`, which has one more level 
than the original index, `{chunk_index}`.
This can ensure the maximum number of items stored in RocksDB.



> For example, in the first round, we have one filter with 16 buckets. e.g. 
> filter0 -> bucket0, ... bucket15.

> And after one rehashing, we can just extend it to a filter with 32 buckets, 
> e.g. filter0 -> bucket0 ..., bucket31, since it is still relatively small and 
> we don't need to store it in two keys.
> In your design it seems no possible 🤔

Yes, it's obvious that for those small-capacity cuckoofilters, such as those 
with a capacity of only 1024 and an expansion of 2, resulting in 2048 after 
expansion, using two RocksDB keys to store the second filter seems very 
wasteful of resources. 





**In summary** 

For the case of expansion=1, after the expansion, the subsequent 
filter1->filterN remain of the same size as filter0

However, in cases where expansion is greater than 1, the cf expansion will show 
an exponential growth. 
At this point, we can control the infinite growth of the filter by setting a 
threshold. Once the threshold is exceeded, the filterN will be split into 
multiple blocks and stored in rocksDB. At this point, the index will become 
'cf:{namespace_key}:{filter_index}:{chunk_index}', and the insertion, deletion, 
and search will be conducted in RocksDB based on this index






GitHub link: 
https://github.com/apache/kvrocks/discussions/3079#discussioncomment-13920092

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to