When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.
Cheers, Till On Mon, Jun 17, 2019 at 4:30 PM John Smith <java.dev....@gmail.com> wrote: > Well some reasons, machine reboots/maintenance etc... Host/VM crashes and > restarts. And same goes for the job manager. I don't want/need to have to > document/remember some start process for sys admins/devops. > > So far I have looked at ./start-cluster.sh and all it seems to do is SSH > into all the specified nodes and starts the processes using the jobmanager > and taskmanager scripts. I don't see anything special in any of the sh > scripts. > I configured passwordless ssh through terraform and all that works great > only when trying to do the manual start through systemd. I may have > something missing... > > > > On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <trohrm...@apache.org> wrote: > >> Hi John, >> >> I have not much experience wrt setting Flink up via systemd services. Why >> do you want to do it like that? >> >> 1. In standalone mode, Flink won't automatically restart TaskManagers. >> This only works on Yarn and Mesos atm. >> 2. In case of a lost TaskManager, you should run `taskmanager.sh start`. >> This script simply starts a new TaskManager process. >> 3. I guess you could use systemd to bring up a Flink TaskManager process >> on start up. >> >> Cheers, >> Till >> >> On Fri, Jun 14, 2019 at 5:56 PM John Smith <java.dev....@gmail.com> >> wrote: >> >>> I looked into the start-cluster.sh and I don't see anything special. So >>> technically it should be as easy as installing Systemd services to run >>> jobamanger.sh and taskmanager.sh respectively? >>> >>> On Wed, 12 Jun 2019 at 13:02, John Smith <java.dev....@gmail.com> wrote: >>> >>>> The installation instructions do not indicate how to create systemd >>>> services. >>>> >>>> 1- When task nodes fail, will the job leader detect this and ssh and >>>> restart the task node? From my testing it doesn't seem like it. >>>> 2- How do we recover a lost node? Do we simply go back to the master >>>> node and run start-cluster.sh and the script is smart enough to figure out >>>> what is missing? >>>> 3- Or do we need to create systemd services and if so on which command >>>> do we start the service on? >>>> >>>