Hi Community,
I have a number of flink jobs running inside my session cluster with varying checkpoint intervals plus a large amount of operator state and in times of high load, the jobs fail due to checkpoint timeouts (set to 6 minutes). I can only assume this is because the latencies for saving checkpoints at these times of high load increase. I have a 30 node HDFS cluster for checkpoints... however I see that only 4 of these nodes are being used for storage. Is there a way of ensuring the load is evenly spread? Could there be another reason for these checkpoint timeouts? Events are being consumed from kafka, to kafka with EXACTLY ONCE guarantees enabled. Thank you very much! M.