Re: Bloom Filter for Rocksdb

xiangyu feng Sun, 29 Oct 2023 20:44:49 -0700

Hi Kean,

I would like to share with you our analysis of the pros and cons about
enabling Bloomfilter in production.

Pros:
By enabling BloomFilter, RocksDB.get() can filter out data files that not
contains this key for sure and hence reduce some random disk reads. This
performance improvement is determined by the state access pattern of the
operator and the random access performance of the disk. In some cases,
operator will always use the latest data or a fixed portion of the data
which are always cached in RocksDB's BlockCache. Then the improvement will
not be significant. In some cases, operator will random access the keys in
RocksDB, enabling bloomfliter in RocksDB will help a lot in this situation.

Cons:
By enabling BloomFilter, RocksDB's compaction process will add bloom filter
information for new generated SST files. This operation executes
asynchronously in the background, will not affect rocksdb's read and write
performance but will cost extra cpu usage/disk space.

Trade-offs:
The length of the bits in bloomfilter will influence the accuracy. Also
more bits means more CPU cost in generation.

So in general, if your job has sufficient CPU resources and random state
access pattern, I would recommend you enabling bloomfilter longer than
10bits.

Hope this helps you.

Regards,
Xiangyu

David Anderson <dander...@apache.org> 于2023年10月30日周一 10:41写道：

> I believe bloom filters are off by default because they add overhead and
> aren't always helpful. I.e., in workloads that are write heavy and have few
> reads, bloom filters aren't worth the overhead.
>
> David
>
> On Fri, Oct 20, 2023 at 11:31 AM Mate Czagany <czmat...@gmail.com> wrote:
>
>> Hi,
>>
>> There have been no reports about setting this configuration causing any
>> issues. I would guess it's off by default because it can increase the
>> memory usage by an unpredictable amount.
>>
>> I would say feel free to enable it, from what you've said I also think
>> that this would improve the performance of your jobs. But make sure to
>> configure your jobs so that they will be able to accommodate the potential
>> memory footprint growth. Also please read the following resources to know
>> more about RocksDBs bloom filter:
>> https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter
>> https://rocksdb.org/blog/2014/09/12/new-bloom-filter-format.html
>>
>> Regards,
>> Mate
>>
>>
>> Kenan Kılıçtepe <kkilict...@gmail.com> ezt írta (időpont: 2023. okt.
>> 20., P, 15:50):
>>
>>> Can someone tell the exact performance effect of enabling bloom filter?
>>> May enabling it cause some unpredictable performance problems?
>>>
>>> I read what it is and how it works and it makes sense but  I also asked
>>> myself why the default value of state.backend.rocksdb.use-bloom-filter is
>>> false.
>>>
>>> We have a 5 servers flink cluster, processing real time IoT data coming
>>> from 5 million devices and for a lot of jobs, we keep different states for
>>> each device.
>>>
>>> Sometimes we have performance issues and when I check the flamegraph on
>>> the test server I always see rocksdb.get() is the blocker. I just want to
>>> increase rocksdb performance.
>>>
>>> Thanks
>>>
>>>

Re: Bloom Filter for Rocksdb

Reply via email to