Re: Flink job recovery after task manager failure

yidan zhao Tue, 01 Mar 2022 18:08:38 -0800

State backend can be set as hashMap or rocksDB.
Checkpoint storage must be a shared file system(nfs or hdfs or something
else).


Afek, Ifat (Nokia - IL/Kfar Sava) <ifat.a...@nokia.com> 于2022年3月2日周三
05:55写道：

> Hi,
>
>
>
> I’m trying to understand the guidelines for task manager recovery.
>
> From what I see in the documentation, state backend can be set as in
> memory / file system / rocksdb, and the checkpoint storage requires a
> shared file system for both file system and rocksdb. Is that correct? Must
> the file system be shared between the task managers and job managers? Is
> there another option?
>
>
>
> Thanks,
>
> Ifat
>
>
>
> *From: *Zhilong Hong <zhlongh...@gmail.com>
> *Date: *Thursday, 24 February 2022 at 19:58
> *To: *"Afek, Ifat (Nokia - IL/Kfar Sava)" <ifat.a...@nokia.com>
> *Cc: *"user@flink.apache.org" <user@flink.apache.org>
> *Subject: *Re: Flink job recovery after task manager failure
>
>
>
> Hi, Afek
>
>
>
> I've read the log you provided. Since you've set the value of
> restart-strategy to be exponential-delay and the value
> of restart-strategy.exponential-delay.initial-backoff is 10s, everytime a
> failover is triggered, the JobManager will have to wait for 10 seconds
> before it restarts the job.If you'd prefer a quicker restart, you could
> change the restart strategy to fixed-delay and set a small value for
> restart-strategy.fixed-delay.delay.
>
>
>
> Furthermore, there are two more failovers that happened during the
> initialization of recovered tasks. During the initialization of a task, it
> will try to recover the states from the last valid checkpoint. A
> FileNotFound exception happens during the recovery process. I'm not quite
> sure the reason. Since the recovery succeeds after two failovers, I think
> maybe it's because the local disks of your cluster are not stable.
>
>
>
> Sincerely,
>
> Zhilong
>
>
>
> On Thu, Feb 24, 2022 at 9:04 PM Afek, Ifat (Nokia - IL/Kfar Sava) <
> ifat.a...@nokia.com> wrote:
>
> Thanks Zhilong.
>
>
>
> The first launch of our job is fast, I don’t think that’s the issue. I see
> in flink job manager log that there were several exceptions during the
> restart, and the task manager was restarted a few times until it was
> stabilized.
>
>
>
> You can find the log here:
>
> jobmanager-log.txt.gz
> <https://nokia-my.sharepoint.com/:u:/p/ifat_afek/EUsu4rb_-BpNrkpvSwzI-vgBtBO9OQlIm0CHtW0gsZ7Gqg?email=zhlonghong%40gmail.com&e=ww5Idt>
>
>
>
> Thanks,
>
> Ifat
>
>
>
> *From: *Zhilong Hong <zhlongh...@gmail.com>
> *Date: *Wednesday, 23 February 2022 at 19:38
> *To: *"Afek, Ifat (Nokia - IL/Kfar Sava)" <ifat.a...@nokia.com>
> *Cc: *"user@flink.apache.org" <user@flink.apache.org>
> *Subject: *Re: Flink job recovery after task manager failure
>
>
>
> Hi, Afek!
>
>
>
> When a TaskManager is killed, JobManager will not be acknowledged until a
> heartbeat timeout happens. Currently, the default value of
> heartbeat.timeout is 50 seconds [1]. That's why it takes more than 30
> seconds for Flink to trigger a failover. If you'd like to shorten the time
> a failover is triggered in this situation, you could decrease the value of
> heartbeat.timeout in flink-conf.yaml. However, if the value is set too
> small, heartbeat timeouts will happen more frequently and the cluster will
> be unstable. As FLINK-23403 [2] mentions, if you are using Flink 1.14 or
> 1.15, you could try to set the value to 10s.
>
>
>
> You mentioned that it takes 5-6 minutes to restart the jobs. It seems a
> bit weird. How long does it take to deploy your job for a brand new launch?
> You could compact and upload the log of JobManager to Google Drive or
> OneDrive and attach the sharing link. Maybe we can find out what happens
> via the log.
>
>
>
> Sincerely,
>
> Zhilong
>
>
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout
>
> [2] https://issues.apache.org/jira/browse/FLINK-23403
>
>
>
> On Thu, Feb 24, 2022 at 12:25 AM Afek, Ifat (Nokia - IL/Kfar Sava) <
> ifat.a...@nokia.com> wrote:
>
> Hi,
>
>
>
> I am trying to use Flink checkpoints solution in order to support task
> manager recovery.
>
> I’m running flink using beam with filesystem storage and the following
> parameters:
>
> checkpointingInterval=30000
>
> checkpointingMode=EXACTLY_ONCE.
>
>
>
> What I see is that if I kill a task manager pod, it takes flink about 30
> seconds to identify the failure and another 5-6 minutes to restart the jobs.
>
> Is there a way to shorten the downtime? What is an expected downtime in
> case the task manager is killed, until the jobs are recovered? Are there
> any best practices for handling it? (e.g. different configuration
> parameters)
>
>
>
> Thanks,
>
> Ifat
>
>
>
>

Re: Flink job recovery after task manager failure

Reply via email to