Thanks for your reply! What I have seen is that the job terminates when there's intermittent loss of connectivity with zookeeper. This is in-fact the most common reason why our jobs are terminating at this point. Worse, it's unable to restore from checkpoint during some (not all) of these terminations. Under these scenarios, won't the job try to recover from a savepoint?
I've gone through various tickets reporting stability issues due to zookeeper that you've mentioned you intend to resolve soon. But until the zookeeper based HA is stable, should we assume that it will repeatedly restore from savepoints? I would rather rely on kafka offsets to resume where it left off rather than savepoints. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/