Re: Flink not restoring from checkpoint when job manager fails even with HA through zookeeper

Teng Fei Liao Sat, 06 Jun 2020 15:54:20 -0700

It seems like the JobManager is treating this as a job failure.
A FAILED JobStatus is a globally terminal state so everything gets deleted
with zookeeper HA.
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/JobStatus.java#L39
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L263
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCheckpointIDCounter.java#L108
.


On Sat, Jun 6, 2020 at 4:38 PM Kathula, Sandeep
<sandeep_kath...@intuit.com.invalid> wrote:

> Hi,
>     We are running Flink 1.9 in K8S. We used
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/jobmanager_high_availability.html
>   to set high availability. We have a single master. We set max number of
> retries for a task to 2. After task fails twice and then the job manager
> fails. This is expected. But it is removing checkpoint from the zookeeper.
> As a result on the restart it is not consuming from the previous
> checkpoint. We are losing the data.
>
> Logs:
>
> 2020/06/06 19:39:07.759 INFO  o.a.f.r.c.CheckpointCoordinator - Stopping
> checkpoint coordinator for job 00000000000000000000000000000000.
> 2020/06/06 19:39:07.759 INFO  o.a.f.r.c.ZooKeeperCompletedCheckpointStore
> - Shutting down
> 2020/06/06 19:39:07.823 INFO  o.a.f.r.z.ZooKeeperStateHandleStore -
> Removing
> /flink/sessionization_test4/checkpoints/00000000000000000000000000000000
> from ZooKeeper
> 2020/06/06 19:39:07.823 INFO  o.a.f.r.c.CompletedCheckpoint - Checkpoint
> with ID 11 at
> 's3://s3_bucket/sessionization_test/checkpoints/00000000000000000000000000000000/chk-11'
> not discarded.
> 2020/06/06 19:39:07.829 INFO  o.a.f.r.c.ZooKeeperCheckpointIDCounter -
> Shutting down.
> 2020/06/06 19:39:07.829 INFO  o.a.f.r.c.ZooKeeperCheckpointIDCounter -
> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
> 2020/06/06 19:39:07.852 INFO  o.a.f.r.dispatcher.MiniDispatcher - Job
> 00000000000000000000000000000000 reached globally terminal state FAILED.
> 2020/06/06 19:39:07.852 INFO  o.a.f.runtime.jobmaster.JobMaster - Stopping
> the JobMaster for job
> sppstandardresourcemanager-flink-0606193838-6d7dae7e(00000000000000000000000000000000).
> 2020/06/06 19:39:07.854 INFO  o.a.f.r.entrypoint.ClusterEntrypoint -
> Shutting StandaloneJobClusterEntryPoint down with application status
> FAILED. Diagnostics null.
> 2020/06/06 19:39:07.854 INFO  o.a.f.r.j.MiniDispatcherRestEndpoint -
> Shutting down rest endpoint.
> 2020/06/06 19:39:07.856 INFO  o.a.f.r.l.ZooKeeperLeaderRetrievalService -
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2020/06/06 19:39:07.859 INFO  o.a.f.r.j.slotpool.SlotPoolImpl - Suspending
> SlotPool.
> 2020/06/06 19:39:07.859 INFO  o.a.f.runtime.jobmaster.JobMaster - Close
> ResourceManager connection d28e9b9e1fc1ba78c2ed010070518057: JobManager is
> shutting down..
> 2020/06/06 19:39:07.859 INFO  o.a.f.r.j.slotpool.SlotPoolImpl - Stopping
> SlotPool.
> 2020/06/06 19:39:07.859 INFO  o.a.f.r.r.StandaloneResourceManager -
> Disconnect job manager afae482ff82bdb26fe275174c14d4...@akka.tcp<mailto:
> afae482ff82bdb26fe275174c14d4...@akka.tcp>://flink@flink-job-cluster:6123/user/jobmanager_0
> for job 00000000000000000000000000000000 from the resource manager.
> 2020/06/06 19:39:07.860 INFO  o.a.f.r.l.ZooKeeperLeaderElectionService -
> Stopping ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
> 2020/06/06 19:39:07.868 INFO  o.a.f.r.j.MiniDispatcherRestEndpoint -
> Removing cache directory
> /tmp/flink-web-ef940924-348b-461c-ab53-255a914ed43a/flink-web-ui
> 2020/06/06 19:39:07.870 INFO  o.a.f.r.l.ZooKeeperLeaderElectionService -
> Stopping ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
> 2020/06/06 19:39:07.870 INFO  o.a.f.r.j.MiniDispatcherRestEndpoint - Shut
> down complete.
> 2020/06/06 19:39:07.870 INFO  o.a.f.r.r.StandaloneResourceManager - Shut
> down cluster because application is in FAILED, diagnostics null.
> 2020/06/06 19:39:07.870 INFO  o.a.f.r.l.ZooKeeperLeaderRetrievalService -
> Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
> 2020/06/06 19:39:07.871 INFO  o.a.f.r.l.ZooKeeperLeaderRetrievalService -
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2020/06/06 19:39:07.871 INFO  o.a.f.r.dispatcher.MiniDispatcher - Stopping
> dispatcher akka.tcp://flink@flink-job-cluster:6123/user/dispatcher.
> 2020/06/06 19:39:07.871 INFO  o.a.f.r.dispatcher.MiniDispatcher - Stopping
> all currently running jobs of dispatcher akka.tcp://flink@flink-job-cluster
> :6123/user/dispatcher.
> 2020/06/06 19:39:07.871 INFO  o.a.f.r.r.s.SlotManagerImpl - Closing the
> SlotManager.
> 2020/06/06 19:39:07.871 INFO  o.a.f.r.r.s.SlotManagerImpl - Suspending the
> SlotManager.
> 2020/06/06 19:39:07.871 INFO  o.a.f.r.l.ZooKeeperLeaderElectionService -
> Stopping ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
> 2020/06/06 19:39:07.871 INFO  o.a.f.r.l.ZooKeeperLeaderRetrievalService -
> Stopping ZooKeeperLeaderRetrievalService
> /leader/00000000000000000000000000000000/job_manager_lock.
> 2020/06/06 19:39:07.974 INFO  o.a.f.r.r.h.l.b.StackTraceSampleCoordinator
> - Shutting down stack trace sample coordinator.
> 2020/06/06 19:39:07.975 INFO  o.a.f.r.l.ZooKeeperLeaderElectionService -
> Stopping ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
> 2020/06/06 19:39:07.975 INFO  o.a.f.r.dispatcher.MiniDispatcher - Stopped
> dispatcher akka.tcp://flink@flink-job-cluster:6123/user/dispatcher.
> 2020/06/06 19:39:07.975 INFO  o.a.flink.runtime.blob.BlobServer - Stopped
> BLOB server at 0.0.0.0:6124
> 2020/06/06 19:39:07.975 INFO  o.a.f.r.h.z.ZooKeeperHaServices - Close and
> clean up all data for ZooKeeperHaServices.
> 2020/06/06 19:39:08.085 INFO  o.a.f.s.c.o.a.c.f.i.CuratorFrameworkImpl -
> backgroundOperationsLoop exiting
> 2020/06/06 19:39:08.090 INFO  o.a.f.s.z.o.a.zookeeper.ClientCnxn -
> EventThread shut down for session: 0x17282452e8c0823
> 2020/06/06 19:39:08.090 INFO  o.a.f.s.z.o.a.zookeeper.ZooKeeper - Session:
> 0x17282452e8c0823 closed
> 2020/06/06 19:39:08.091 INFO  o.a.f.r.rpc.akka.AkkaRpcService - Stopping
> Akka RPC service.
> 2020/06/06 19:39:08.093 INFO  o.a.f.r.rpc.akka.AkkaRpcService - Stopping
> Akka RPC service.
> 2020/06/06 19:39:08.096 INFO
> a.r.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
> 2020/06/06 19:39:08.097 INFO
> a.r.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down;
> proceeding with flushing remote transports.
> 2020/06/06 19:39:08.099 INFO
> a.r.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
> 2020/06/06 19:39:08.099 INFO
> a.r.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down;
> proceeding with flushing remote transports.
> 2020/06/06 19:39:08.108 INFO
> a.r.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
> 2020/06/06 19:39:08.114 INFO
> a.r.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
> 2020/06/06 19:39:08.123 INFO  o.a.f.r.rpc.akka.AkkaRpcService - Stopped
> Akka RPC service.
> 2020/06/06 19:39:08.124 INFO  o.a.f.r.entrypoint.ClusterEntrypoint -
> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
> exit code 1443.
>
>
>
>
>
>
>
>
>
> Can you please help?
>
> Thanks
> Sandeep Kathula
>
>

Re: Flink not restoring from checkpoint when job manager fails even with HA through zookeeper

Reply via email to