Thanks for the update Piotr. The reason it prevents us from using checkpoints is this: We are relying on the checkpoints to trigger commit of Kafka offsets for our source (kafka consumers). When there is no backpressure this works fine. When there is backpressure, checkpoints fail because they take too long, and our Kafka offsets are never committed to Kafka brokers (as we just learned the hard way).
Normally there is no backpressure in our jobs, but when there is some outage, then the jobs do experience backpressure when catching up. And when you're already trying to recover from an incident, that is not the ideal time for kafka offsets commits to stop working. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/