Hi, Thanks for your suggestions @Elias! I have a brief look at "Cuckoo Filter" and "Golumb Compressed Sequence", my first sensation is that maybe "Golumc Compressed Sequence" is not a good choose, because it seems to require non-constant lookup time, but Cuckoo Filter maybe a good choose, I should definitely have a deeper look at it.
Beside, to me, all of this filters seems to a "variant" of the bloom filter(which is the smallest unit to store data in the current desgin), the main challenge for introducing BF into flink is the data skewed(which is common phenomenon on production) problem, could you maybe also have a look at the solution that I posted on the google doc https://docs.google.com/document/d/17UY5RZ1mq--hPzFx-LfBjCAw_kkoIrI9KHovXWkxNYY/edit?usp=sharing for this problem, It would be nice if you could give us some advice on that. Best, Sihua On 05/24/2018 07:21,Elias Levy<fearsome.lucid...@gmail.com> wrote: I would suggest you consider an alternative data structures: a Cuckoo Filter or a Golumb Compressed Sequence. The GCS data structure was introduced in Cache-, Hash- and Space-Efficient Bloom Filters <http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf> by F. Putze, P. Sanders, and J. Singler. See section 4. We should discuss which exact implementation of bloom filters are the best fit. @Fabian: There are also implementations of bloom filters that use counting and therefore support deletes, but obviously this comes at the cost of a potentially higher space consumption. Am 23.05.2018 um 11:29 schrieb Fabian Hueske <fhue...@gmail.com>: IMO, such a feature would be very interesting. However, my concerns with Bloom Filter is that they are insert-only data structures, i.e., it is not possible to remove keys once they were added. This might render the filter useless over time. In a different thread (see discussion in FLINK-8918 [1]), you mentioned that the Bloom Filters would be growing. If we keep them in memory, how can we prevent them from exceeding memory boundaries over time?