Hi Tim,John, I do agree with the issue John mentioned and have the same problem.
We can only start a standalone HA cluster with ./start-cluster.sh script. And then when there are failures, we can restart those components individually by calling jobmanager.sh/ jobmanager.sh. This works great But , Like John mentioned, If we want to start the cluster initially itself by running the jobmanager.sh on each JobManager nodes, it is not working. It binds to local and not forming the HA cluster. Thanks, Shakir From: Till Rohrmann <trohrm...@apache.org> Date: Tuesday, June 18, 2019 at 4:23 PM To: John Smith <java.dev....@gmail.com> Cc: user <user@flink.apache.org> Subject: [EXTERNAL] Re: How to restart/recover on reboot? I guess it should work if you installed a systemd service which simply calls `jobmanager.sh start` or `taskmanager.sh start`. Cheers, Till On Tue, Jun 18, 2019 at 4:29 PM John Smith <java.dev....@gmail.com<mailto:java.dev....@gmail.com>> wrote: Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units. I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host. Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason. On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <trohrm...@apache.org<mailto:trohrm...@apache.org>> wrote: When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines. Cheers, Till On Mon, Jun 17, 2019 at 4:30 PM John Smith <java.dev....@gmail.com<mailto:java.dev....@gmail.com>> wrote: Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops. So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts. I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing... On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <trohrm...@apache.org<mailto:trohrm...@apache.org>> wrote: Hi John, I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that? 1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm. 2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process. 3. I guess you could use systemd to bring up a Flink TaskManager process on start up. Cheers, Till On Fri, Jun 14, 2019 at 5:56 PM John Smith <java.dev....@gmail.com<mailto:java.dev....@gmail.com>> wrote: I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively? On Wed, 12 Jun 2019 at 13:02, John Smith <java.dev....@gmail.com<mailto:java.dev....@gmail.com>> wrote: The installation instructions do not indicate how to create systemd services. 1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it. 2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing? 3- Or do we need to create systemd services and if so on which command do we start the service on?