We have observe that sometimes job stuck in suspended state, and no job restart/recover were attempted once job is suspended. * it is a high-parallelism job (like close to 2,000) * there were a few job restarts before this * there were high GC pause during the period * zookeeper timeout. probably caused by high GC pause
Is it related to https://issues.apache.org/jira/browse/FLINK-11537? I pasted some logs in the end. Thanks, Steven 2019-02-28 19:04:36,357 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in speci fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 100.82.141.106/100.82.141.106:2181 2019-02-28 19:04:36,357 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 100.82.141.106/100.82.141.106:2181, initiating session 2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2, negotiated timeout = 40000 2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again. 2019-02-28 19:04:36,360 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarte ... 2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state RESTARTING to SUSPENDING. org.apache.flink.util.FlinkException: JobManager is no longer the leader. at org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371) at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state SUSPENDING to SUSPENDED. 2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0. 2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Suspending 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job 0e594c065c7f8319a12fa47e089ca9b0 has been suspended. ... 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is no longer the leader.. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0). 2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool. 2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}. 2019-02-28 19:05:09,452 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.