[ https://issues.apache.org/jira/browse/FLINK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608037#comment-17608037 ]
Yun Tang commented on FLINK-29329: ---------------------------------- I think the problem of not triggering the checkpoints anymore should be related to the [schedule timer|https://github.com/apache/flink/blob/b5cd9f34ab73fa69a3db5a09908c1aa954ed0597/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L168]. If you could reproduce this problem, I think you could use jmap dump the job manager to see what happened to CheckpointCoordinator#timer. > Checkpoint can not be triggered if encountering OOM > --------------------------------------------------- > > Key: FLINK-29329 > URL: https://issues.apache.org/jira/browse/FLINK-29329 > Project: Flink > Issue Type: Bug > Reporter: Yuxin Tan > Priority: Major > Fix For: 1.13.7 > > Attachments: job-exceptions-1.txt > > > When writing a checkpoint, an OOM error is thrown. But the JM is not failed > and is restored because I found a log "No master state to restore". > Then JM never makes checkpoints anymore. Currently, the root cause is not > that clear, maybe this is a bug and we should deal with the OOM or other > exceptions when making checkpoints. -- This message was sent by Atlassian Jira (v8.20.10#820010)