We had an issue recently where no checkpoints were able to complete, with the following message in the job manager logs
2019-09-25 12:27:57,159 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline checkpoint 7041 by task 1f789ac3c5df655fe5482932b2255fd3 of job 214ccf9ab5edfb00f3bec3f454b57402. 2019-09-25 12:27:57,172 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Discarding checkpoint 7041 of job 214ccf9ab5edfb00f3bec3f454b57402 because: Could not materialize checkpoint 7041 for operator uk.co.bbc.sawmill.streaming.pipeline.transformations.concurrentstreams.ConcurrentStreamsAggregator PERFORM COUNT DISTINCT OVER UUIDS FOR KEY -> ParDo(ToConcurrentStreamsResult)/ParMultiDo(ToConcurrentStreamsResult) -> JdbcIO.Write/ParDo(Write)/ParMultiDo(Write) (8/32). This meant no checkpoints could ever complete until we restarted the job (we have the don’t fail on checkpoint failure flag set) It’s difficult to debug why this happened though because, from inspecting the task manager logs for the affected task, there are no exceptions being reported during the affected times, and there is no stack trace in the job manager logs when the checkpoint gets declined/discarded. I don’t know whether a stack trace would give more context or not, but I can see the log line being printed here https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1255<https://github.com/apache/flink/blob/release-1.9.0/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1347> – which doesn’t print the stack trace of the problem. Is there something else we can look at to try and determine what happened? Note this is not a recurring issue