We have observe that sometimes job stuck in suspended state, and no job
restart/recover were attempted once job is suspended.
* it is a high-parallelism job (like close to 2,000)
* there were a few job restarts before this
* there were high GC pause during the period
* zookeeper timeout. probably caused by high GC pause

Is it related to https://issues.apache.org/jira/browse/FLINK-11537?

I pasted some logs in the end.

Thanks,
Steven

2019-02-28 19:04:36,357 WARN
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in speci
fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will
continue connection to Zookeeper server without SASL authentication, if
Zookeeper server allows it.
2019-02-28 19:04:36,357 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Opening socket connection to server 100.82.141.106/100.82.141.106:2181
2019-02-28 19:04:36,357 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
Authentication failed
2019-02-28 19:04:36,357 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
connection established to 100.82.141.106/100.82.141.106:2181, initiating
session
2019-02-28 19:04:36,359 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Session establishment complete on server 100.82.141.106/100.82.141.106:2181,
sessionid = 0x365ef9c4fe7f1f2, negotiated timeout = 40000
2019-02-28 19:04:36,359 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
- State change: RECONNECTED
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
monitored again.
2019-02-28 19:04:36,360 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
...
2019-02-28 19:05:09,400 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-02-28 19:05:09,400 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
RESTARTING to SUSPENDING.
org.apache.flink.util.FlinkException: JobManager is no longer the leader.
        at
org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
        at
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
        at
org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
        at
org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2019-02-28 19:05:09,403 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
SUSPENDING to SUSPENDED.
2019-02-28 19:05:09,403 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping
checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
2019-02-28 19:05:09,403 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Suspending
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
Shutting down.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
...
2019-02-28 19:05:09,448 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Close ResourceManager connection
9db2027a0a32f2a44744a0d4a0f84b87: JobManager is no longer the leader..
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
SlotPool.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Stopping the JobMaster for job
cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
2019-02-28 19:05:09,449 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
SlotPool.
2019-02-28 19:05:09,449 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Stopping ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
2019-02-28 19:05:09,452 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.

Reply via email to