gt;
>
>
>
>
>
>
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/#state-backend-rocksdb-metrics-estimate-num-keys
>
>
>
> *From:* Mason Chen
> *Sent:* Dienstag, 11. Januar 2022 19:20
> *To:* Piotr Nowojski
&g
]
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/#state-backend-rocksdb-metrics-estimate-num-keys
From: Mason Chen
Sent: Dienstag, 11. Januar 2022 19:20
To: Piotr Nowojski
Cc: Mason Chen ; user
Subject: Re: unaligned checkpoint for job with large start delay
Hi
Hi Piotrek,
No worries—I hope you had a good break.
> Counting how many windows have been registered/fired and plotting that over
> time.
It’s straightforward to count windows that are fired (the trigger exposes the
run time context and we can collect the information in that code path).
Howeve
Hi Mason,
Sorry for a late reply, but I was OoO.
I think you could confirm it with more custom metrics. Counting how many
windows have been registered/fired and plotting that over time.
I think it would be more helpful in this case to check how long a task has
been blocked being "busy" processin
Hi Piotrek,
> In other words, something (presumably a watermark) has fired more than 151
> 200 windows at once, which is taking ~1h 10minutes to process and during this
> time the checkpoint can not make any progress. Is this number of triggered
> windows plausible in your scenario?
It seems p
Hi Mason,
Those checkpoint timeouts (30 minutes) have you already observed with the
alignment timeout set to 0ms? Or as you were previously running it with 1s
alignment timeout?
If the latter, it might be because unaligned checkpoints are failing to
kick in in the first place. Setting the timeout
Hi Piotr,
Thanks for the link to the JIRA ticket, we actually don’t see much state size
overhead between checkpoints in aligned vs unaligned, so we will go with your
recommendation of using unaligned checkpoints with 0s alignment timeout.
For context, we are testing unaligned checkpoints with o
Hi Mason,
In Flink 1.14 we have also changed the timeout behavior from checking
against the alignment duration, to simply checking how old is the
checkpoint barrier (so it would also account for the start delay) [1]. It
was done in order to solve problems as you are describing. Unfortunately we
ca
Hi all,
I'm using Flink 1.13 and my job is experiencing high start delay, more so
than high alignment time. (our flip 27 kafka source is heavily
backpressured). Since our alignment timeout is set to 1s, the unaligned
checkpoint never triggers since alignment delay is always below the
threshold.
I