[ https://issues.apache.org/jira/browse/FLINK-11172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723586#comment-16723586 ]
Hequn Cheng commented on FLINK-11172: ------------------------------------- Hi all, thanks for your discussion and suggestions. [~fhueske] I found bounded over will clean up state if retention time has been configured. Should we remove the retention logic for them? I created another jira to address the problem of bounded over. see FLINK-11188 > Remove the max retention time in StreamQueryConfig > -------------------------------------------------- > > Key: FLINK-11172 > URL: https://issues.apache.org/jira/browse/FLINK-11172 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL > Affects Versions: 1.8.0 > Reporter: Yangze Guo > Assignee: Yangze Guo > Priority: Major > > [Stream Query > Config|https://ci.apache.org/projects/flink/flink-docs-master/dev/table/streaming/query_configuration.html] > is an important and useful feature to make a tradeoff between accuracy and > resource consumption when some query executed in unbounded streaming data. > This feature first proposed in > [FLINK-6491|https://issues.apache.org/jira/browse/FLINK-6491]. > At the first, *QueryConfig* take two parameters, i.e. > minIdleStateRetentionTime and maxIdleStateRetentionTime, to avoid to register > many timers if we have more freedom when to discard state. However, this > approach may cause new data expired earlier than old data and thus greater > accuracy loss appeared in some case. For example, we have an unbounded keyed > streaming data. We process key *_a_* in _*t0*_ and _*b*_ in _*t1,*_ *_t0 < > t1_*. *_a_* will expired in _*a+maxIdleStateRetentionTime*_ while _*b*_ > expired in *_b+maxIdleStateRetentionTime_*. Now, another data with key *_a_* > arrived in _*t2 (t1 < t2)*_. But _*t2+minIdleStateRetentionTime*_ < > _*a+maxIdleStateRetentionTime*_. The state of key *_a_* will still be expired > in _*a+maxIdleStateRetentionTime*_ which is early than the state of key > _*b*_. According to the guideline of > [LRU|https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)] > that the element has been most heavily used in the past few instructions are > most likely to be used heavily in the next few instructions too. The state > with key _*a*_ should live longer than the state with key _*b*_. Current > approach against this idea. > I think we now have a good chance to remove the maxIdleStateRetentionTime > argument in *StreamQueryConfig.* Below are my reasons. > * [FLINK-9423|https://issues.apache.org/jira/browse/FLINK-9423] implement > efficient deletes for heap-based timer service. We can leverage the deletion > op to mitigate the abuse of timer registration. > * Current approach can cause new data expired earlier than old data and thus > greater accuracy loss appeared in some case. Users need to fine-tune these > two parameter to avoid this scenario. Directly following the idea of LRU > looks like a better solution. > So, I plan to remove maxIdleStateRetentionTime, update the expire time only > depends on _*minIdleStateRetentionTime.*_ > cc to [~sunjincheng121], [~fhueske] -- This message was sent by Atlassian JIRA (v7.6.3#76005)