[jira] [Commented] (FLINK-8601) Introduce LinkedBloomFilterState for Approximate calculation and other situations of performance optimization

Sihua Zhou (JIRA) Thu, 08 Feb 2018 00:53:06 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356663#comment-16356663
 ]


Sihua Zhou commented on FLINK-8601:
-----------------------------------

[~fhueske], thanks for you reply. I think this is not a new state primitive, so 
it does't have to be supported by all state backends (Indeed, I think rocksdb 
backend could not support this at all because of the serialization problem), it 
just like "InternelTimeService". And yes, I am already working on this feature 
(Indeed I have implemented this and use it in production base on flink 1.3.2 
...). 

> Introduce LinkedBloomFilterState for Approximate calculation and other 
> situations of performance optimization
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8601
>                 URL: https://issues.apache.org/jira/browse/FLINK-8601
>             Project: Flink
>          Issue Type: New Feature
>          Components: Core, DataStream API
>    Affects Versions: 1.4.0
>            Reporter: Sihua Zhou
>            Assignee: Sihua Zhou
>            Priority: Major
>
> h3. Backgroud
> Bloom filter is useful in many situation, for example:
>  * 1. Approximate calculation: deduplication (eg: UV calculation)
>  * 2. Performance optimization: eg, [runtime filter 
> join|https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]
> However, based on the current status provided by flink, it is hard to use the 
> bloom filter for the following reasons:
>  * 1. Serialization problem: Bloom filter status can be large (for example: 
> 100M), if implement it based on the RocksDB state, the state data will need 
> to be serialized each time it is queried and updated, and the performance 
> will be very poor.
>  * 2. Data skewed: Data in different key group can be skewed, and the 
> information of data skewed can not be accurately predicted before the program 
> is running. Therefore, it is impossible to determine how much resources bloom 
> filter should allocate. One way to do this is to allocate space needed for 
> the most skewed case, but this can lead to very serious waste of resources.
> h3. Requirement
> Therefore, I introduce the LinkedBloomFilterState for flink, which at least 
> need to meet the following features:
>  * 1. Support for changing Parallelism
>  * 2. Only serialize when necessary: when performing checkpoint
>  * 3. Can deal with data skew problem: users only need to specify a 
> LinkedBloomFilterState with the desired input, fpp, system will allocate 
> resource dynamic.
>  * 4. Do not conflict with other state: user can use KeyedState and 
> OperateState when using bloom filter state.
>  * 5. Support relax ttl (ie: the data survival time at least greater than the 
> specified time)
> Design doc:  [design 
> doc|https://docs.google.com/document/d/1yMCT2ogE0CtSjzRvldgi0ZPPxC791PpkVGkVeqaUUI8/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-8601) Introduce LinkedBloomFilterState for Approximate calculation and other situations of performance optimization

Reply via email to