Hi,
We are running Flink 1.9 in K8S. We used
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/jobmanager_high_availability.html
to set high availability. We have a single master. We set max number of
retries for a task to 2. After task fails twice and then the job manager fails.
This is expected. But it is removing checkpoint from the zookeeper. As a result
on the restart it is not consuming from the previous checkpoint. We are losing
the data.
Logs:
2020/06/06 19:39:07.759 INFO o.a.f.r.c.CheckpointCoordinator - Stopping
checkpoint coordinator for job 00000000000000000000000000000000.
2020/06/06 19:39:07.759 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore -
Shutting down
2020/06/06 19:39:07.823 INFO o.a.f.r.z.ZooKeeperStateHandleStore - Removing
/flink/sessionization_test4/checkpoints/00000000000000000000000000000000 from
ZooKeeper
2020/06/06 19:39:07.823 INFO o.a.f.r.c.CompletedCheckpoint - Checkpoint with
ID 11 at
's3://s3_bucket/sessionization_test/checkpoints/00000000000000000000000000000000/chk-11'
not discarded.
2020/06/06 19:39:07.829 INFO o.a.f.r.c.ZooKeeperCheckpointIDCounter - Shutting
down.
2020/06/06 19:39:07.829 INFO o.a.f.r.c.ZooKeeperCheckpointIDCounter - Removing
/checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
2020/06/06 19:39:07.852 INFO o.a.f.r.dispatcher.MiniDispatcher - Job
00000000000000000000000000000000 reached globally terminal state FAILED.
2020/06/06 19:39:07.852 INFO o.a.f.runtime.jobmaster.JobMaster - Stopping the
JobMaster for job
sppstandardresourcemanager-flink-0606193838-6d7dae7e(00000000000000000000000000000000).
2020/06/06 19:39:07.854 INFO o.a.f.r.entrypoint.ClusterEntrypoint - Shutting
StandaloneJobClusterEntryPoint down with application status FAILED. Diagnostics
null.
2020/06/06 19:39:07.854 INFO o.a.f.r.j.MiniDispatcherRestEndpoint - Shutting
down rest endpoint.
2020/06/06 19:39:07.856 INFO o.a.f.r.l.ZooKeeperLeaderRetrievalService -
Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2020/06/06 19:39:07.859 INFO o.a.f.r.j.slotpool.SlotPoolImpl - Suspending
SlotPool.
2020/06/06 19:39:07.859 INFO o.a.f.runtime.jobmaster.JobMaster - Close
ResourceManager connection d28e9b9e1fc1ba78c2ed010070518057: JobManager is
shutting down..
2020/06/06 19:39:07.859 INFO o.a.f.r.j.slotpool.SlotPoolImpl - Stopping
SlotPool.
2020/06/06 19:39:07.859 INFO o.a.f.r.r.StandaloneResourceManager - Disconnect
job manager
[email protected]<mailto:[email protected]>://flink@flink-job-cluster:6123/user/jobmanager_0
for job 00000000000000000000000000000000 from the resource manager.
2020/06/06 19:39:07.860 INFO o.a.f.r.l.ZooKeeperLeaderElectionService -
Stopping ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
2020/06/06 19:39:07.868 INFO o.a.f.r.j.MiniDispatcherRestEndpoint - Removing
cache directory /tmp/flink-web-ef940924-348b-461c-ab53-255a914ed43a/flink-web-ui
2020/06/06 19:39:07.870 INFO o.a.f.r.l.ZooKeeperLeaderElectionService -
Stopping ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2020/06/06 19:39:07.870 INFO o.a.f.r.j.MiniDispatcherRestEndpoint - Shut down
complete.
2020/06/06 19:39:07.870 INFO o.a.f.r.r.StandaloneResourceManager - Shut down
cluster because application is in FAILED, diagnostics null.
2020/06/06 19:39:07.870 INFO o.a.f.r.l.ZooKeeperLeaderRetrievalService -
Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2020/06/06 19:39:07.871 INFO o.a.f.r.l.ZooKeeperLeaderRetrievalService -
Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2020/06/06 19:39:07.871 INFO o.a.f.r.dispatcher.MiniDispatcher - Stopping
dispatcher akka.tcp://flink@flink-job-cluster:6123/user/dispatcher.
2020/06/06 19:39:07.871 INFO o.a.f.r.dispatcher.MiniDispatcher - Stopping all
currently running jobs of dispatcher
akka.tcp://flink@flink-job-cluster:6123/user/dispatcher.
2020/06/06 19:39:07.871 INFO o.a.f.r.r.s.SlotManagerImpl - Closing the
SlotManager.
2020/06/06 19:39:07.871 INFO o.a.f.r.r.s.SlotManagerImpl - Suspending the
SlotManager.
2020/06/06 19:39:07.871 INFO o.a.f.r.l.ZooKeeperLeaderElectionService -
Stopping ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2020/06/06 19:39:07.871 INFO o.a.f.r.l.ZooKeeperLeaderRetrievalService -
Stopping ZooKeeperLeaderRetrievalService
/leader/00000000000000000000000000000000/job_manager_lock.
2020/06/06 19:39:07.974 INFO o.a.f.r.r.h.l.b.StackTraceSampleCoordinator -
Shutting down stack trace sample coordinator.
2020/06/06 19:39:07.975 INFO o.a.f.r.l.ZooKeeperLeaderElectionService -
Stopping ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2020/06/06 19:39:07.975 INFO o.a.f.r.dispatcher.MiniDispatcher - Stopped
dispatcher akka.tcp://flink@flink-job-cluster:6123/user/dispatcher.
2020/06/06 19:39:07.975 INFO o.a.flink.runtime.blob.BlobServer - Stopped BLOB
server at 0.0.0.0:6124
2020/06/06 19:39:07.975 INFO o.a.f.r.h.z.ZooKeeperHaServices - Close and clean
up all data for ZooKeeperHaServices.
2020/06/06 19:39:08.085 INFO o.a.f.s.c.o.a.c.f.i.CuratorFrameworkImpl -
backgroundOperationsLoop exiting
2020/06/06 19:39:08.090 INFO o.a.f.s.z.o.a.zookeeper.ClientCnxn - EventThread
shut down for session: 0x17282452e8c0823
2020/06/06 19:39:08.090 INFO o.a.f.s.z.o.a.zookeeper.ZooKeeper - Session:
0x17282452e8c0823 closed
2020/06/06 19:39:08.091 INFO o.a.f.r.rpc.akka.AkkaRpcService - Stopping Akka
RPC service.
2020/06/06 19:39:08.093 INFO o.a.f.r.rpc.akka.AkkaRpcService - Stopping Akka
RPC service.
2020/06/06 19:39:08.096 INFO a.r.RemoteActorRefProvider$RemotingTerminator -
Shutting down remote daemon.
2020/06/06 19:39:08.097 INFO a.r.RemoteActorRefProvider$RemotingTerminator -
Remote daemon shut down; proceeding with flushing remote transports.
2020/06/06 19:39:08.099 INFO a.r.RemoteActorRefProvider$RemotingTerminator -
Shutting down remote daemon.
2020/06/06 19:39:08.099 INFO a.r.RemoteActorRefProvider$RemotingTerminator -
Remote daemon shut down; proceeding with flushing remote transports.
2020/06/06 19:39:08.108 INFO a.r.RemoteActorRefProvider$RemotingTerminator -
Remoting shut down.
2020/06/06 19:39:08.114 INFO a.r.RemoteActorRefProvider$RemotingTerminator -
Remoting shut down.
2020/06/06 19:39:08.123 INFO o.a.f.r.rpc.akka.AkkaRpcService - Stopped Akka
RPC service.
2020/06/06 19:39:08.124 INFO o.a.f.r.entrypoint.ClusterEntrypoint -
Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with exit
code 1443.
Can you please help?
Thanks
Sandeep Kathula