[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805157#comment-17805157 ]
Matthias Pohl commented on FLINK-34007: --------------------------------------- Ok, I went through the log file you shared. AFAIS, suspending the JobManager worked as expected: * The Job with the ID {{217cee964b2cfdc3115fb74cac0ec550}} was suspended due to the leadership loss for session ID {{9987190b-35f4-4238-b317-057dc3615e4d}}. * The ResourceManager and the Dispatcher got their leadership revoked as well. * The ResourceManager is not shut down. * The Dispatcher is stopped but the corresponding DispatcherLeaderProcess keeps running. That's the process that should trigger another Dispatcher initialization if it picks up leadership again. The {{RecipientUnreachableException}} appears because there's no leader being re-elected, I guess. Does this match your findings? You're not having any other standby JM running in the Flink cluster as far as I understand? We would expect this very same JobManager to pick up leadership again. Do we have some logs from the Kubernetes cluster that we could investigate? > Flink Job stuck in suspend state after losing leadership in HA Mode > ------------------------------------------------------------------- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.18.1, 1.18.2 > Reporter: Zhenqiu Huang > Priority: Major > Attachments: job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)