Hi Tony, Thanks for your response! I would definitely check supervisord.
I wonder if there is a way that I can recover the killed JM and add it back to the cluster by using one of the scripts in the *flink/bin/* Thanks! Best regards, Mu On Thu, Feb 1, 2018 at 3:50 PM, Tony Wei <tony19920...@gmail.com> wrote: > Hi Mu, > > AFAIK, that is the expected behavior when you launch your cluster in > standalone mode. Flink HA guarantees that the standby JM will take over the > whole cluster. The illustration just said recovered JM will become another > standby machine, but recovering a single instance is not the Flink HA's > responsibility. > One possible way might be using supervisord [1] to launch your JM > instance, it can help you monitor your process and automatically restart > when the process accidentally failed. Or you can use YARN cluster, the YARN > cluster will be responsible for recovering the dead JM. > > Best, > Tony Wei > > [1] http://supervisord.org/ > > 2018-02-01 14:11 GMT+08:00 Mu Kong <kong.mu....@gmail.com>: > >> Hi all, >> >> I have a Flink HA cluster with 2 job managers and a zookeeper quorum of 3 >> nodes. >> >> My failed job manager didn't get recovered after I killed it. >> Here is how I didn't it and what I've observed: >> >> 1. I started the HA cluster with start-cluster.sh >> 2. Job manager A got elected. >> 3. I killed job manager A with kill command. >> 4. Job manager B got elected. >> 5. Job manager B was working well. >> 6. But job manager A never recovered since then. >> >> Do I miss something here or is it the case that HA cannot handle such >> failover(the flink instance gets killed directly)? >> >> Thanks! >> >> Best regards, >> Mu >> > >