Hi Steven, is this the tail of the logs or are there other statements following? I think your problem could indeed be related to FLINK-11537. Is it possible to somehow reliably reproduce this problem? If yes, then you could try out the RC for Flink 1.8.0 which should be published in the next days.
Cheers, Till On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <stevenz...@gmail.com> wrote: > We have observe that sometimes job stuck in suspended state, and no job > restart/recover were attempted once job is suspended. > * it is a high-parallelism job (like close to 2,000) > * there were a few job restarts before this > * there were high GC pause during the period > * zookeeper timeout. probably caused by high GC pause > > Is it related to https://issues.apache.org/jira/browse/FLINK-11537? > > I pasted some logs in the end. > > Thanks, > Steven > > 2019-02-28 19:04:36,357 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL > configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named 'Client' was found in speci > fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will > continue connection to Zookeeper server without SASL authentication, if > Zookeeper server allows it. > 2019-02-28 19:04:36,357 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - > Opening socket connection to server 100.82.141.106/100.82.141.106:2181 > 2019-02-28 19:04:36,357 ERROR > org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - > Authentication failed > 2019-02-28 19:04:36,357 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket > connection established to 100.82.141.106/100.82.141.106:2181, initiating > session > 2019-02-28 19:04:36,359 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - > Session establishment complete on server > 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2, > negotiated timeout = 40000 > 2019-02-28 19:04:36,359 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: RECONNECTED > 2019-02-28 19:04:36,359 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper was reconnected. Leader election can be restarted. > 2019-02-28 19:04:36,359 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper was reconnected. Leader election can be restarted. > 2019-02-28 19:04:36,359 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper was reconnected. Leader election can be restarted. > 2019-02-28 19:04:36,359 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper was reconnected. Leader election can be restarted. > 2019-02-28 19:04:36,359 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. > 2019-02-28 19:04:36,359 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. > 2019-02-28 19:04:36,359 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are > monitored again. > 2019-02-28 19:04:36,360 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper was reconnected. Leader retrieval can be restarte > ... > 2019-02-28 19:05:09,400 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > 2019-02-28 19:05:09,400 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state > RESTARTING to SUSPENDING. > org.apache.flink.util.FlinkException: JobManager is no longer the leader. > at > org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371) > at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) > at > org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) > at > org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2019-02-28 19:05:09,403 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state > SUSPENDING to SUSPENDED. > 2019-02-28 19:05:09,403 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping > checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0. > 2019-02-28 19:05:09,403 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Suspending > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - > Shutting down. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > 0e594c065c7f8319a12fa47e089ca9b0 has been suspended. > ... > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.jobmaster.JobMaster - Close > ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is > no longer the leader.. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending > SlotPool. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc > endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started > yet. Discarding message > org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing > is started. > 2019-02-28 19:05:09,448 INFO > org.apache.flink.runtime.jobmaster.JobMaster - Stopping > the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0). > 2019-02-28 19:05:09,449 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping > SlotPool. > 2019-02-28 19:05:09,449 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Stopping ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}. > 2019-02-28 19:05:09,452 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper. >