Thanks for the explanation. I hope that either 1.5 will solve your issue (please let us know if it doesn’t!) or if you can’t wait, that decreasing memory buffers can mitigate the problem.
Piotrek > On 5 Apr 2018, at 08:13, Edward <egb...@hotmail.com> wrote: > > Thanks for the update Piotr. > > The reason it prevents us from using checkpoints is this: > We are relying on the checkpoints to trigger commit of Kafka offsets for our > source (kafka consumers). > When there is no backpressure this works fine. When there is backpressure, > checkpoints fail because they take too long, and our Kafka offsets are never > committed to Kafka brokers (as we just learned the hard way). > > Normally there is no backpressure in our jobs, but when there is some > outage, then the jobs do experience > backpressure when catching up. And when you're already trying to recover > from an incident, that is not the ideal time for kafka offsets commits to > stop working. > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/