Ah, I think I can just use ./bin/jobmanager.sh https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/cluster_setup.html#adding-a-jobmanager
Thanks! On Thu, Feb 1, 2018 at 4:00 PM, Mu Kong <kong.mu....@gmail.com> wrote: > Hi Tony, > > Thanks for your response! > I would definitely check supervisord. > > I wonder if there is a way that I can recover the killed JM and add it > back to the cluster by using one of the scripts in the *flink/bin/* > > > Thanks! > > > Best regards, > Mu > > > On Thu, Feb 1, 2018 at 3:50 PM, Tony Wei <tony19920...@gmail.com> wrote: > >> Hi Mu, >> >> AFAIK, that is the expected behavior when you launch your cluster in >> standalone mode. Flink HA guarantees that the standby JM will take over the >> whole cluster. The illustration just said recovered JM will become another >> standby machine, but recovering a single instance is not the Flink HA's >> responsibility. >> One possible way might be using supervisord [1] to launch your JM >> instance, it can help you monitor your process and automatically restart >> when the process accidentally failed. Or you can use YARN cluster, the YARN >> cluster will be responsible for recovering the dead JM. >> >> Best, >> Tony Wei >> >> [1] http://supervisord.org/ >> >> 2018-02-01 14:11 GMT+08:00 Mu Kong <kong.mu....@gmail.com>: >> >>> Hi all, >>> >>> I have a Flink HA cluster with 2 job managers and a zookeeper quorum of >>> 3 nodes. >>> >>> My failed job manager didn't get recovered after I killed it. >>> Here is how I didn't it and what I've observed: >>> >>> 1. I started the HA cluster with start-cluster.sh >>> 2. Job manager A got elected. >>> 3. I killed job manager A with kill command. >>> 4. Job manager B got elected. >>> 5. Job manager B was working well. >>> 6. But job manager A never recovered since then. >>> >>> Do I miss something here or is it the case that HA cannot handle such >>> failover(the flink instance gets killed directly)? >>> >>> Thanks! >>> >>> Best regards, >>> Mu >>> >> >> >