Re: Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

Dcosta, Agnelo (HBO) Wed, 27 Jan 2021 23:08:45 -0800

Hi Dawid,
Thanks for the tip on time constraint. We are using within in our 
MATCH_RECOGNIZE clause. It set to 3 minutes.
Increase in checkpoint size problem still persists.


Thanks for adding comments to FLINK-15160. I will take a look at changes you 
suggested.

P.S. :
I initially meant to ask what is the difference between
table.exec.state.ttl 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/config.html#table-exec-state-ttl

And
setIdleStateRetentionTime: 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/query_configuration.html#idle-state-retention-time
And if table.exec.state.ttl makes any difference to match_recognize state ?

From: Dawid Wysakowicz <dwysakow...@apache.org>
Date: Wednesday, January 27, 2021 at 12:41 AM
To: Dcosta, Agnelo (HBO) <agnelo.dco...@hbo.com>, user@flink.apache.org 
<user@flink.apache.org>
Subject: Re: Difference between table.exec.source.idle-timeout and 
setIdleStateRetentionTime ?
**External Email received from: dwysakow...@apache.org **


Hey,

As for the MATCH_RECOGNIZE clause, I highly recommend applying a time 
constraint[1]. The idle state retention time does not apply to the 
MATCH_RECOGNIZE, but you can think of the time constraint as something similar, 
but it is closer to the actual query logic.

If you are hitting FLINK-15160 unfortunately I don't have a good solution for 
it. The only thing that comes to my mind is adding a heartbeat event to the 
event stream to prune the partial matches, but I understand it is quite 
invasive.

If you would be willing to help fixing the problem in FLINK, I could also help 
review it and give pointers how it could be done.

Best,

Dawid

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/streaming/match_recognize.html#time-constraint
On 26/01/2021 17:39, Dcosta, Agnelo (HBO) wrote:
Hi Dawid, thanks for the clarification and it helps a lot.
Reply to couple of points :

what is causing the state to grow?
We are using flink SQL and have 5 pattern match queries , 3 group by tumble 
windows. State growth over time is primarily coming from pattern match queries.

Is it ever growing keyspace?
Yes. By design our keyspace is ever growing. The expectation is that messages 
for one key will come in for couple of hours, then stop coming in. We would 
never see messages from that key again. New keys are constantly coming in.

Is it that a watermark does not progress?
Watermark on the subtask level is constantly updating and is in sync with other 
subtasks. We have not seen any issues with watermark updating as such.

Looking through mailing list archive, our problem seems similar to
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Memory-in-CEP-queries-keep-increasing-td31045.html
https://issues.apache.org/jira/browse/FLINK-15160 : Clean up is not applied if 
there are no incoming events for a key.

By design we can have partial matched states/matches in pattern match queries. 
And key space is such that no new event comes in for those partial matches.

thanks.

From: Dawid Wysakowicz <dwysakow...@apache.org><mailto:dwysakow...@apache.org>
Date: Tuesday, January 26, 2021 at 3:14 AM
To: Dcosta, Agnelo (HBO) <agnelo.dco...@hbo.com><mailto:agnelo.dco...@hbo.com>, 
user@flink.apache.org<mailto:user@flink.apache.org> 
<user@flink.apache.org><mailto:user@flink.apache.org>
Subject: Re: Difference between table.exec.source.idle-timeout and 
setIdleStateRetentionTime ?
**External Email received from: 
dwysakow...@apache.org<mailto:dwysakow...@apache.org> **


Hi,

The difference is that the table.exec.source.idle-timeout is used for dealing 
with source idleness[1]. It is a problem that a watermark cannot advance if 
some of the partition become idle (do not produce any records). Watermark is 
always the minimum of watermarks of all input partitions. The setting makes 
flink ignore certain partitions in the calculation after the time threshold is 
reached.

The IdleStateRetention is Table API specific. As described in the link you 
provided it removes entries from a state for keys that were not seen for a 
given time threshold.

As for your issue, I'd recommend first investigating what is causing the state 
to grow. Is it ever growing keyspace? Is it that a watermark does not progress 
(this should manifest in results as well). Or is it something else.

Best,

Dawid



[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/event_timestamps_watermarks.html#dealing-with-idle-sources
On 25/01/2021 20:12, Dcosta, Agnelo (HBO) wrote:

Hi,

What is the difference between table.exec.source.idle-timeout and 
setIdleStateRetentionTime ?

table.exec.source.idle-timeout: 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/config.html#table-exec-source-idle-timeout



setIdleStateRetentionTime: 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/query_configuration.html#idle-state-retention-time



Some context:
Hi we are using flink 1.12.
Our checkpoint size is constantly increasing once app is deployed.
After performing a restart, checkpoint size goes back to expected size.
Looking at actual checkpoint files generated, it seems our app is holding on to 
state/events since the time app started up.
Based on our sql, the maximum time we would need to hold state is 10 minutes.

This e-mail is intended only for the use of the addressees. Any copying, 
forwarding, printing or other use of this e-mail by persons other than the 
addressees is not authorized. This e-mail may contain information that is 
privileged, confidential and exempt from disclosure. If you are not the 
intended recipient, please notify us immediately by return e-mail (including 
the original message in your reply) and then delete and discard all copies of 
the e-mail. Thank you.
HB75

Re: Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

Reply via email to