Difficult to debug reason for checkpoint decline

Daniel Harper Mon, 07 Oct 2019 00:51:58 -0700

We had an issue recently where no checkpoints were able to complete, with the 
following message in the job manager logs


2019-09-25 12:27:57,159 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Decline 
checkpoint 7041 by task 1f789ac3c5df655fe5482932b2255fd3 of job 
214ccf9ab5edfb00f3bec3f454b57402.
2019-09-25 12:27:57,172 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Discarding 
checkpoint 7041 of job 214ccf9ab5edfb00f3bec3f454b57402 because: Could not 
materialize checkpoint 7041 for operator 
uk.co.bbc.sawmill.streaming.pipeline.transformations.concurrentstreams.ConcurrentStreamsAggregator
 PERFORM COUNT DISTINCT OVER UUIDS FOR KEY -> 
ParDo(ToConcurrentStreamsResult)/ParMultiDo(ToConcurrentStreamsResult) -> 
JdbcIO.Write/ParDo(Write)/ParMultiDo(Write) (8/32).

This meant no checkpoints could ever complete until we restarted the job (we 
have the don’t fail on checkpoint failure flag set)

It’s difficult to debug why this happened though because, from inspecting the 
task manager logs for the affected task, there are no exceptions being reported 
during the affected times, and there is no stack trace in the job manager logs 
when the checkpoint gets declined/discarded.

I don’t know whether a stack trace would give more context or not, but I can 
see the log line being printed here 
https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1255<https://github.com/apache/flink/blob/release-1.9.0/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1347>
 – which doesn’t print the stack trace of the problem.

Is there something else we can look at to try and determine what happened?

Note this is not a recurring issue

Difficult to debug reason for checkpoint decline

Reply via email to