[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805157#comment-17805157
 ] 

Matthias Pohl commented on FLINK-34007:
---------------------------------------

Ok, I went through the log file you shared. AFAIS, suspending the JobManager 
worked as expected:
* The Job with the ID {{217cee964b2cfdc3115fb74cac0ec550}} was suspended due to 
the leadership loss for session ID {{9987190b-35f4-4238-b317-057dc3615e4d}}.
* The ResourceManager and the Dispatcher got their leadership revoked as well.
* The ResourceManager is not shut down. 
* The Dispatcher is stopped but the corresponding DispatcherLeaderProcess keeps 
running. That's the process that should trigger another Dispatcher 
initialization if it picks up leadership again.

The {{RecipientUnreachableException}} appears because there's no leader being 
re-elected, I guess. Does this match your findings?

You're not having any other standby JM running in the Flink cluster as far as I 
understand? We would expect this very same JobManager to pick up leadership 
again. Do we have some logs from the Kubernetes cluster that we could 
investigate?

> Flink Job stuck in suspend state after losing leadership in HA Mode
> -------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>         Attachments: job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to