Re: How to restart/recover on reboot?

John Smith Tue, 18 Jun 2019 07:29:27 -0700

Yes, that is understood. But I don't see why we cannot call jobmanager.sh
and taskmanager.sh to build the cluster and have them run as systemd units.


I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh
which then cascades to taskmanager.sh I just have to pin point what's
missing to have systemd service working. In fact calling jobmanager.sh as
systemd service actually sees the shared masters, slaves and
flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with
./start-cluster.sh and then install systemd services for jobmanager.sh and
tsakmanager.sh

Like I said I don't want to have some process in place to remind admins
they need to manually start a node every time they patch or a host goes
down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <trohrm...@apache.org> wrote:

> When a single machine fails you should rather call `taskmanager.sh
> start`/`jobmanager.sh start` to start a single process. `start-cluster.sh`
> will start multiple processes on different machines.
>
> Cheers,
> Till
>
> On Mon, Jun 17, 2019 at 4:30 PM John Smith <java.dev....@gmail.com> wrote:
>
>> Well some reasons, machine reboots/maintenance etc... Host/VM crashes and
>> restarts. And same goes for the job manager. I don't want/need to have to
>> document/remember some start process for sys admins/devops.
>>
>> So far I have looked at ./start-cluster.sh and all it seems to do is SSH
>> into all the specified nodes and starts the processes using the jobmanager
>> and taskmanager scripts. I don't see anything special in any of the sh
>> scripts.
>> I configured passwordless ssh through terraform and all that works great
>> only when trying to do the manual start through systemd. I may have
>> something missing...
>>
>>
>>
>> On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <trohrm...@apache.org> wrote:
>>
>>> Hi John,
>>>
>>> I have not much experience wrt setting Flink up via systemd services.
>>> Why do you want to do it like that?
>>>
>>> 1. In standalone mode, Flink won't automatically restart TaskManagers.
>>> This only works on Yarn and Mesos atm.
>>> 2. In case of a lost TaskManager, you should run `taskmanager.sh start`.
>>> This script simply starts a new TaskManager process.
>>> 3. I guess you could use systemd to bring up a Flink TaskManager process
>>> on start up.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, Jun 14, 2019 at 5:56 PM John Smith <java.dev....@gmail.com>
>>> wrote:
>>>
>>>> I looked into the start-cluster.sh and I don't see anything special. So
>>>> technically it should be as easy as installing Systemd services to run
>>>> jobamanger.sh and taskmanager.sh respectively?
>>>>
>>>> On Wed, 12 Jun 2019 at 13:02, John Smith <java.dev....@gmail.com>
>>>> wrote:
>>>>
>>>>> The installation instructions do not indicate how to create systemd
>>>>> services.
>>>>>
>>>>> 1- When task nodes fail, will the job leader detect this and ssh and
>>>>> restart the task node? From my testing it doesn't seem like it.
>>>>> 2- How do we recover a lost node? Do we simply go back to the master
>>>>> node and run start-cluster.sh and the script is smart enough to figure out
>>>>> what is missing?
>>>>> 3- Or do we need to create systemd services and if so on which command
>>>>> do we start the service on?
>>>>>
>>>>

Re: How to restart/recover on reboot?

Reply via email to