Re: Troubleshooting checkpoint timeout

Parag Somani Wed, 20 Oct 2021 00:21:52 -0700

I had similar problem, where i have concurrent two checkpoints were
configured. Also, i used to save it in S3(using minio) on k8s 1.18 env.


Flink service were getting restarted and timeout was happening. It got
resolved:
1. As minio ran out of disk space, caused failure of checkpoints(this was
the main cause).
2. Added duration/interval of checkpoint parameter to address it
execution.checkpointing.max-concurrent-checkpoints and
execution.checkpointing.min-pause
Details of same at:
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#checkpointing


On Wed, Oct 20, 2021 at 7:50 AM Caizhi Weng <tsreape...@gmail.com> wrote:

> Hi!
>
> I see you're using sliding event time windows. What's the exact value of
> windowLengthMinutes and windowSlideTimeMinutes? If windowLengthMinutes is
> large and windowSlideTimeMinutes is small then each record may be assigned
> to a large number of windows as the pipeline proceeds, thus gradually slows
> down checkpointing and finally causes a timeout.
>
> Alexis Sarda-Espinosa <alexis.sarda-espin...@microfocus.com>
> 于2021年10月19日周二 下午7:29写道：
>
>> Hello everyone,
>>
>>
>>
>> I am doing performance tests for one of our streaming applications and,
>> after increasing the throughput a bit (~500 events per minute), it has
>> started failing because checkpoints cannot be completed within 10 minutes.
>> The Flink cluster is not exactly under my control and is running on
>> Kubernetes with version 1.11.3 and RocksDB backend.
>>
>>
>>
>> I can access the UI and logs and have confirmed:
>>
>>
>>
>>    - Logs do indicate expired checkpoints.
>>    - There is no backpressure in any operator.
>>    - When checkpoints do complete (seemingly at random):
>>       - Size is 10-20MB.
>>       - Sync and Async durations are at most 1-2 seconds.
>>       - In one of the tasks, alignment takes 1-3 minutes, but start
>>       delays grow to up to 5 minutes.
>>    - The aforementioned task (the one with 5-minute start delay) has 8
>>    sub-tasks and I see no indication of data skew. When the checkpoint times
>>    out, none of the sub-tasks have acknowledged the checkpoint.
>>
>>
>>
>> The problematic task that is failing very often (and holding back
>> downstream tasks) consists of the following operations:
>>
>>
>>
>> timestampedEventStream = events
>>
>>                 .keyBy(keySelector)
>>
>>                 .assignTimestampsAndWatermarks(watermarkStrategy);
>>
>>
>>
>> windowedStream =
>> DataStreamUtils.reinterpretAsKeyedStream(timestampedEventStream,
>> keySelector)
>>
>>                 .window(SlidingEventTimeWindows.of(
>>
>>                         Time.minutes(windowLengthMinutes),
>>
>>                         Time.minutes(windowSlideTimeMinutes)))
>>
>>                 .allowedLateness(Time.minutes(allowedLatenessMinutes));
>>
>>
>>
>> windowedStream
>>
>>                     .process(new ProcessWindowFunction1(config))
>>
>>                     // add sink
>>
>>
>>
>> windowedStream
>>
>>                     .process(new ProcessWindowFunction2(config))
>>
>>                     // add sink
>>
>>
>>
>> Both window functions are using managed state, but nothing out of the
>> ordinary (as mentioned above, state size is actually very small). Do note
>> that the same windowedStream is used twice.
>>
>>
>>
>> I don’t see any obvious runtime issues and I don’t think the load is
>> particularly high, but maybe there’s something wrong in my pipeline
>> definition? What else could cause these timeouts?
>>
>>
>>
>> Regards,
>>
>> Alexis.
>>
>>
>>
>

-- 
Regards,
Parag Surajmal Somani.

Re: Troubleshooting checkpoint timeout

Reply via email to