Re: Checkpoint failures without exceptions

Yun Gao Wed, 27 Oct 2021 07:24:32 -0700

Hi Patrick,

Could you also have a look at the stack of the tasks of the
second function to see what the main thread and netty
thread is doing during the checkpoint period ?


Best,
Yun



 ------------------Original Mail ------------------
Sender: <patrick.eif...@sony.com>
Send Date:Wed Oct 27 22:05:40 2021
Recipients:Flink ML <user@flink.apache.org>
Subject:Checkpoint failures without exceptions

Hi Flink Community,
I have an issue with failing checkpoints on all stateful jobs in a session 
cluster which I’m unable to track down so far. The jobs sit between Kafka.

Only the first checkpoint gets completed all others fail.
The watermarks are progressing regularly and are aligned between sub tasks.
In the Flink Web UI the backpressure is showing as OK.
The target topic gets all records outputted as expected. No exceptions occurred 
in the jobs.

The metrics for inPoolUsage and outPoolUsage show 0 but thenumRecordsOut of 
theWindow Processor (written with the KeyedProcessFunction) shows the expecting 
incrementing number. The Checkpoint alignment time shows 0ms.

I’m using Flink 1.11.1 and enabled unaligned checkpoints. The state and the 
checkpoints are currently just stored in memory of the job manager node – no 
state backend is configured. The jobmanager and task manager have enough memory.

The job graph has 2 process functions:
Filter and enrich events
Keyby/WindowProcessor to KafkaSink

The checkpoints for the first process function always gets completed. For the 
second the checkpoints show up as 0 % acknowledged which fail the whole 
checkpoint.

Finally the checkpoint failures only happens in an environment with higher load 
– the same jobs run fine in another env with lower load.

The window duration is set to 24 hours and the checkpoints are set as follows:

checkpoint-interval = 5 minutes
min-pause-between-checkpoints = 1 minute
checkpoint-timeout = 10 minutes

The kafka source is configured with forBoundedOutOfOrderness and idleness 
parameters.

I’m wondering what am I missing here.

Thanks!

Re: Checkpoint failures without exceptions

Reply via email to