Thanks for the pointers Alexander and David!

We are exploring reactive mode on Flink 1.13 and my questions are merely
hypothetical.  Our streaming job consumes from Kafka, performs enrichment
by querying external services and sinks to S3. Under backpressure or at
random times, we observed one or two subtasks that wouldn't acknowledge in
10 mins(timeout interval) with no metric information(like sync/async
duration, alignment duration...) on UI.  Unaligned checkpoints didn't
address this either.

In such instances, if the HPA scales up the replicas and the job restarts,
it will rewind the offsets from a previous checkpoint . One idea was to see
if we can take the job health into consideration as an additional metric to
HPA to prevent this.

"Buffer de-bloating" looks promising.

- Aryan

On Tue, Apr 12, 2022 at 2:22 AM David Morávek <d...@apache.org> wrote:

> Hi Aryan,
>
> this is an interesting thought. What kind of option do you have in mind?
> My take on this is that if checkpoint times out, it's pretty likely that
> the next one will timeout as well and the scheduler has no way of knowing
> that the next one would succeed. Also up-scaling might help to mitigate the
> timeout in some cases. Can you also elaborate on the reason the checkpoint
> is timing out? Could this be addressed by unaligned checkpoints and buffer
> de-bloating by any chance?
>
> Best,
> D.
>
> On Tue, Apr 12, 2022 at 10:15 AM Alexander Preuß <
> alexanderpre...@ververica.com> wrote:
>
>> Hello,
>>
>> There are no scheduler-specific options for checkpointing. You can
>> however set `execution.checkpointing.tolerable-failed-checkpoints` to 0 to
>> forbid checkpoint failures (
>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-tolerable-failed-checkpoints
>> ).
>>
>> Best regards,
>> Alexander
>>
>>
>>
>> On Tue, Apr 12, 2022 at 6:43 AM aryan m <maryan8...@gmail.com> wrote:
>>
>>> Hello !
>>>
>>>   Are there options in reactive mode to prevent a job from restarting if
>>> the last checkpoint failed or timed out due to any reason ?
>>>
>>>
>>> Thanks,
>>> AR
>>>
>>>
>>
>> --
>>
>> Alexander Preuß | Engineer - Data Intensive Systems
>>
>> alexanderpre...@ververica.com
>>
>> <https://www.ververica.com/>
>>
>>
>> Follow us @VervericaData
>>
>> --
>>
>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>> Conference
>>
>> Stream Processing | Event Driven | Real Time
>>
>> --
>>
>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>
>> --
>>
>> Ververica GmbH
>>
>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>>
>> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
>> Jinwei (Kevin) Zhang
>>
>>

Reply via email to