Hi Patrick, do you even have so much backpressure that unaligned checkpoints are necessary? You seem to have only one network exchange where unaligned checkpoint helps. The Flink 1.11 implementation of unaligned checkpoint was still experimental and it might cause unexpected side-effects. Afaik, we fully recommend unaligned checkpoint only from Flink 1.13 onward in production settings. You may also want to reduce your network buffers with aligned checkpoints to get more reliable checkpointing times under backpressure [1].
TL;DR I would turn off unaligned checkpoints and see what happens. If you see that unaligned checkpoints are necessary, I'd upgrade Flink. [1] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#memory-configuration On Wed, Oct 27, 2021 at 4:24 PM Yun Gao <yungao...@aliyun.com> wrote: > Hi Patrick, > > Could you also have a look at the stack of the tasks of the > second function to see what the main thread and netty > thread is doing during the checkpoint period ? > > Best, > Yun > > > ------------------Original Mail ------------------ > *Sender:* <patrick.eif...@sony.com> > *Send Date:*Wed Oct 27 22:05:40 2021 > *Recipients:*Flink ML <user@flink.apache.org> > *Subject:*Checkpoint failures without exceptions > >> Hi Flink Community, >> >> >> >> I have an issue with failing checkpoints on all stateful jobs in a >> session cluster which I’m unable to track down so far. The jobs sit between >> Kafka. >> >> >> >> Only the first checkpoint gets completed all others fail. >> >> The watermarks are progressing regularly and are aligned between sub >> tasks. >> >> In the Flink Web UI the backpressure is showing as OK. >> >> The target topic gets all records outputted as expected. No exceptions >> occurred in the jobs. >> >> >> >> The metrics for inPoolUsage and outPoolUsage show 0 but thenumRecordsOut >> of theWindow Processor (written with the KeyedProcessFunction) shows the >> expecting incrementing number. The Checkpoint alignment time shows 0ms. >> >> >> >> I’m using Flink 1.11.1 and enabled unaligned checkpoints. The state and >> the checkpoints are currently just stored in memory of the job manager node >> – no state backend is configured. The jobmanager and task manager have >> enough memory. >> >> >> >> The job graph has 2 process functions: >> >> 1. Filter and enrich events >> 2. Keyby/WindowProcessor to KafkaSink >> >> >> >> The checkpoints for the first process function always gets completed. For >> the second the checkpoints show up as 0 % acknowledged which fail the whole >> checkpoint. >> >> >> >> Finally the checkpoint failures only happens in an environment with >> higher load – the same jobs run fine in another env with lower load. >> >> >> >> The window duration is set to 24 hours and the checkpoints are set as >> follows: >> >> >> checkpoint-interval = 5 minutes >> >> min-pause-between-checkpoints = 1 minute >> >> checkpoint-timeout = 10 minutes >> >> >> >> The kafka source is configured with forBoundedOutOfOrderness and idleness >> parameters. >> >> >> >> I’m wondering what am I missing here. >> >> >> >> Thanks! >> >> >> >> >> >