State backend can be set as hashMap or rocksDB. Checkpoint storage must be a shared file system(nfs or hdfs or something else).
Afek, Ifat (Nokia - IL/Kfar Sava) <ifat.a...@nokia.com> 于2022年3月2日周三 05:55写道: > Hi, > > > > I’m trying to understand the guidelines for task manager recovery. > > From what I see in the documentation, state backend can be set as in > memory / file system / rocksdb, and the checkpoint storage requires a > shared file system for both file system and rocksdb. Is that correct? Must > the file system be shared between the task managers and job managers? Is > there another option? > > > > Thanks, > > Ifat > > > > *From: *Zhilong Hong <zhlongh...@gmail.com> > *Date: *Thursday, 24 February 2022 at 19:58 > *To: *"Afek, Ifat (Nokia - IL/Kfar Sava)" <ifat.a...@nokia.com> > *Cc: *"user@flink.apache.org" <user@flink.apache.org> > *Subject: *Re: Flink job recovery after task manager failure > > > > Hi, Afek > > > > I've read the log you provided. Since you've set the value of > restart-strategy to be exponential-delay and the value > of restart-strategy.exponential-delay.initial-backoff is 10s, everytime a > failover is triggered, the JobManager will have to wait for 10 seconds > before it restarts the job.If you'd prefer a quicker restart, you could > change the restart strategy to fixed-delay and set a small value for > restart-strategy.fixed-delay.delay. > > > > Furthermore, there are two more failovers that happened during the > initialization of recovered tasks. During the initialization of a task, it > will try to recover the states from the last valid checkpoint. A > FileNotFound exception happens during the recovery process. I'm not quite > sure the reason. Since the recovery succeeds after two failovers, I think > maybe it's because the local disks of your cluster are not stable. > > > > Sincerely, > > Zhilong > > > > On Thu, Feb 24, 2022 at 9:04 PM Afek, Ifat (Nokia - IL/Kfar Sava) < > ifat.a...@nokia.com> wrote: > > Thanks Zhilong. > > > > The first launch of our job is fast, I don’t think that’s the issue. I see > in flink job manager log that there were several exceptions during the > restart, and the task manager was restarted a few times until it was > stabilized. > > > > You can find the log here: > > jobmanager-log.txt.gz > <https://nokia-my.sharepoint.com/:u:/p/ifat_afek/EUsu4rb_-BpNrkpvSwzI-vgBtBO9OQlIm0CHtW0gsZ7Gqg?email=zhlonghong%40gmail.com&e=ww5Idt> > > > > Thanks, > > Ifat > > > > *From: *Zhilong Hong <zhlongh...@gmail.com> > *Date: *Wednesday, 23 February 2022 at 19:38 > *To: *"Afek, Ifat (Nokia - IL/Kfar Sava)" <ifat.a...@nokia.com> > *Cc: *"user@flink.apache.org" <user@flink.apache.org> > *Subject: *Re: Flink job recovery after task manager failure > > > > Hi, Afek! > > > > When a TaskManager is killed, JobManager will not be acknowledged until a > heartbeat timeout happens. Currently, the default value of > heartbeat.timeout is 50 seconds [1]. That's why it takes more than 30 > seconds for Flink to trigger a failover. If you'd like to shorten the time > a failover is triggered in this situation, you could decrease the value of > heartbeat.timeout in flink-conf.yaml. However, if the value is set too > small, heartbeat timeouts will happen more frequently and the cluster will > be unstable. As FLINK-23403 [2] mentions, if you are using Flink 1.14 or > 1.15, you could try to set the value to 10s. > > > > You mentioned that it takes 5-6 minutes to restart the jobs. It seems a > bit weird. How long does it take to deploy your job for a brand new launch? > You could compact and upload the log of JobManager to Google Drive or > OneDrive and attach the sharing link. Maybe we can find out what happens > via the log. > > > > Sincerely, > > Zhilong > > > > [1] > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout > > [2] https://issues.apache.org/jira/browse/FLINK-23403 > > > > On Thu, Feb 24, 2022 at 12:25 AM Afek, Ifat (Nokia - IL/Kfar Sava) < > ifat.a...@nokia.com> wrote: > > Hi, > > > > I am trying to use Flink checkpoints solution in order to support task > manager recovery. > > I’m running flink using beam with filesystem storage and the following > parameters: > > checkpointingInterval=30000 > > checkpointingMode=EXACTLY_ONCE. > > > > What I see is that if I kill a task manager pod, it takes flink about 30 > seconds to identify the failure and another 5-6 minutes to restart the jobs. > > Is there a way to shorten the downtime? What is an expected downtime in > case the task manager is killed, until the jobs are recovered? Are there > any best practices for handling it? (e.g. different configuration > parameters) > > > > Thanks, > > Ifat > > > >