[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261061#comment-17261061 ]
Yun Tang commented on FLINK-20886: ---------------------------------- Currently, Flink would abort checkpoints via network with message once expired, and it would try to [stop the async phase of checkpoints|https://github.com/apache/flink/blob/c6786ab9cf7e40be41a5a9c12461d5e60a789195/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java#L443]. I just wonder whether Flink could have time to print the thread dump once timeout. > Add the option to get a threaddump on checkpoint timeouts > --------------------------------------------------------- > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Reporter: Nico Kruber > Priority: Major > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.3.4#803005)