There does indeed appear to be a code path in the StreamTask where an
exception might not be logger on the TaskExecutor.
(StreamTask#handleExecutionException)
In FLINK-10753 the CheckpointCoordinator was adjusted to log the full
stacktrace, and is part of 1.5.6.
On 07/10/2019 09:51, Daniel Harper wrote:
We had an issue recently where no checkpoints were able to complete,
with the following message in the job manager logs
2019-09-25 12:27:57,159 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline
checkpoint 7041 by task 1f789ac3c5df655fe5482932b2255fd3 of job
214ccf9ab5edfb00f3bec3f454b57402.
2019-09-25 12:27:57,172 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator -
Discarding checkpoint 7041 of job 214ccf9ab5edfb00f3bec3f454b57402
because: Could not materialize checkpoint 7041 for operator
uk.co.bbc.sawmill.streaming.pipeline.transformations.concurrentstreams.ConcurrentStreamsAggregator
PERFORM COUNT DISTINCT OVER UUIDS FOR KEY ->
ParDo(ToConcurrentStreamsResult)/ParMultiDo(ToConcurrentStreamsResult)
-> JdbcIO.Write/ParDo(Write)/ParMultiDo(Write) (8/32).
This meant no checkpoints could ever complete until we restarted the
job (we have the don’t fail on checkpoint failure flag set)
It’s difficult to debug why this happened though because, from
inspecting the task manager logs for the affected task, there are no
exceptions being reported during the affected times, and there is no
stack trace in the job manager logs when the checkpoint gets
declined/discarded.
I don’t know whether a stack trace would give more context or not, but
I can see the log line being printed here
https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1255
<https://github.com/apache/flink/blob/release-1.9.0/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1347> –
which doesn’t print the stack trace of the problem.
Is there something else we can look at to try and determine what
happened?
Note this is not a recurring issue