Checkpoint timeouts at times of high load

Geldenhuys, Morgan Karl Thu, 01 Apr 2021 01:33:46 -0700

Hi Community,


I have a number of flink jobs running inside my session cluster with varying 
checkpoint intervals plus a large amount of operator state and in times of high 
load, the jobs fail due to checkpoint timeouts (set to 6 minutes). I can only 
assume this is because the latencies for saving checkpoints at these times of 
high load increase. I have a 30 node HDFS cluster for checkpoints... however I 
see that only 4 of these nodes are being used for storage. Is there a way of 
ensuring the load is evenly spread? Could there be another reason for these 
checkpoint timeouts? Events are being consumed from kafka, to kafka with 
EXACTLY ONCE guarantees enabled.


Thank you very much!


M.

Checkpoint timeouts at times of high load

Reply via email to