Till, I will send you the complete log offline. We don't know how to reliably reproduce the problem. but it did happen quite frequently, like once every a couple of days. Let me see if I can cherry pick the fix/commit to 1.7 branch.
Thanks, Steven On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Steven, > > is this the tail of the logs or are there other statements following? I > think your problem could indeed be related to FLINK-11537. Is it possible > to somehow reliably reproduce this problem? If yes, then you could try out > the RC for Flink 1.8.0 which should be published in the next days. > > Cheers, > Till > > On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <stevenz...@gmail.com> wrote: > >> We have observe that sometimes job stuck in suspended state, and no job >> restart/recover were attempted once job is suspended. >> * it is a high-parallelism job (like close to 2,000) >> * there were a few job restarts before this >> * there were high GC pause during the period >> * zookeeper timeout. probably caused by high GC pause >> >> Is it related to https://issues.apache.org/jira/browse/FLINK-11537? >> >> I pasted some logs in the end. >> >> Thanks, >> Steven >> >> 2019-02-28 19:04:36,357 WARN >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL >> configuration failed: javax.security.auth.login.LoginException: No JAAS >> configuration section named 'Client' was found in speci >> fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will >> continue connection to Zookeeper server without SASL authentication, if >> Zookeeper server allows it. >> 2019-02-28 19:04:36,357 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> Opening socket connection to server 100.82.141.106/100.82.141.106:2181 >> 2019-02-28 19:04:36,357 ERROR >> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - >> Authentication failed >> 2019-02-28 19:04:36,357 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket >> connection established to 100.82.141.106/100.82.141.106:2181, initiating >> session >> 2019-02-28 19:04:36,359 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> Session establishment complete on server >> 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2, >> negotiated timeout = 40000 >> 2019-02-28 19:04:36,359 INFO >> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager >> - State change: RECONNECTED >> 2019-02-28 19:04:36,359 INFO >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper was reconnected. Leader election can be restarted. >> 2019-02-28 19:04:36,359 INFO >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper was reconnected. Leader election can be restarted. >> 2019-02-28 19:04:36,359 INFO >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper was reconnected. Leader election can be restarted. >> 2019-02-28 19:04:36,359 INFO >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper was reconnected. Leader election can be restarted. >> 2019-02-28 19:04:36,359 INFO >> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - >> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. >> 2019-02-28 19:04:36,359 INFO >> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - >> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. >> 2019-02-28 19:04:36,359 INFO >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are >> monitored again. >> 2019-02-28 19:04:36,360 INFO >> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - >> Connection to ZooKeeper was reconnected. Leader retrieval can be restarte >> ... >> 2019-02-28 19:05:09,400 INFO >> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - >> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. >> 2019-02-28 19:05:09,400 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state >> RESTARTING to SUSPENDING. >> org.apache.flink.util.FlinkException: JobManager is no longer the leader. >> at >> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371) >> at >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) >> at >> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) >> at >> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> 2019-02-28 19:05:09,403 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state >> SUSPENDING to SUSPENDED. >> 2019-02-28 19:05:09,403 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping >> checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0. >> 2019-02-28 19:05:09,403 INFO >> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - >> Suspending >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >> Shutting down. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >> 0e594c065c7f8319a12fa47e089ca9b0 has been suspended. >> ... >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.jobmaster.JobMaster - Close >> ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is >> no longer the leader.. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending >> SlotPool. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc >> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started >> yet. Discarding message >> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing >> is started. >> 2019-02-28 19:05:09,448 INFO >> org.apache.flink.runtime.jobmaster.JobMaster - Stopping >> the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0). >> 2019-02-28 19:05:09,449 INFO >> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping >> SlotPool. >> 2019-02-28 19:05:09,449 INFO >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Stopping ZooKeeperLeaderElectionService >> ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}. >> 2019-02-28 19:05:09,452 INFO >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >> Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper. >> >