Hi Patrick, Could you also have a look at the stack of the tasks of the second function to see what the main thread and netty thread is doing during the checkpoint period ?
Best, Yun ------------------Original Mail ------------------ Sender: <patrick.eif...@sony.com> Send Date:Wed Oct 27 22:05:40 2021 Recipients:Flink ML <user@flink.apache.org> Subject:Checkpoint failures without exceptions Hi Flink Community, I have an issue with failing checkpoints on all stateful jobs in a session cluster which I’m unable to track down so far. The jobs sit between Kafka. Only the first checkpoint gets completed all others fail. The watermarks are progressing regularly and are aligned between sub tasks. In the Flink Web UI the backpressure is showing as OK. The target topic gets all records outputted as expected. No exceptions occurred in the jobs. The metrics for inPoolUsage and outPoolUsage show 0 but thenumRecordsOut of theWindow Processor (written with the KeyedProcessFunction) shows the expecting incrementing number. The Checkpoint alignment time shows 0ms. I’m using Flink 1.11.1 and enabled unaligned checkpoints. The state and the checkpoints are currently just stored in memory of the job manager node – no state backend is configured. The jobmanager and task manager have enough memory. The job graph has 2 process functions: Filter and enrich events Keyby/WindowProcessor to KafkaSink The checkpoints for the first process function always gets completed. For the second the checkpoints show up as 0 % acknowledged which fail the whole checkpoint. Finally the checkpoint failures only happens in an environment with higher load – the same jobs run fine in another env with lower load. The window duration is set to 24 hours and the checkpoints are set as follows: checkpoint-interval = 5 minutes min-pause-between-checkpoints = 1 minute checkpoint-timeout = 10 minutes The kafka source is configured with forBoundedOutOfOrderness and idleness parameters. I’m wondering what am I missing here. Thanks!