Till,

I will send you the complete log offline. We don't know how to reliably
reproduce the problem. but it did happen quite frequently, like once every
a couple of days. Let me see if I can cherry pick the fix/commit to 1.7
branch.

Thanks,
Steven


On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Steven,
>
> is this the tail of the logs or are there other statements following? I
> think your problem could indeed be related to FLINK-11537. Is it possible
> to somehow reliably reproduce this problem? If yes, then you could try out
> the RC for Flink 1.8.0 which should be published in the next days.
>
> Cheers,
> Till
>
> On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <stevenz...@gmail.com> wrote:
>
>> We have observe that sometimes job stuck in suspended state, and no job
>> restart/recover were attempted once job is suspended.
>> * it is a high-parallelism job (like close to 2,000)
>> * there were a few job restarts before this
>> * there were high GC pause during the period
>> * zookeeper timeout. probably caused by high GC pause
>>
>> Is it related to https://issues.apache.org/jira/browse/FLINK-11537?
>>
>> I pasted some logs in the end.
>>
>> Thanks,
>> Steven
>>
>> 2019-02-28 19:04:36,357 WARN
>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>> configuration section named 'Client' was found in speci
>> fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will
>> continue connection to Zookeeper server without SASL authentication, if
>> Zookeeper server allows it.
>> 2019-02-28 19:04:36,357 INFO
>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Opening socket connection to server 100.82.141.106/100.82.141.106:2181
>> 2019-02-28 19:04:36,357 ERROR
>> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>> Authentication failed
>> 2019-02-28 19:04:36,357 INFO
>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
>> connection established to 100.82.141.106/100.82.141.106:2181, initiating
>> session
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Session establishment complete on server
>> 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2,
>> negotiated timeout = 40000
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>> - State change: RECONNECTED
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
>> monitored again.
>> 2019-02-28 19:04:36,360 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
>> ...
>> 2019-02-28 19:05:09,400 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>> 2019-02-28 19:05:09,400 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
>> RESTARTING to SUSPENDING.
>> org.apache.flink.util.FlinkException: JobManager is no longer the leader.
>>         at
>> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
>>         at
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>         at java.lang.Thread.run(Thread.java:748)
>> 2019-02-28 19:05:09,403 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
>> SUSPENDING to SUSPENDED.
>> 2019-02-28 19:05:09,403 INFO
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping
>> checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
>> 2019-02-28 19:05:09,403 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>> Suspending
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>> Shutting down.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>> 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
>> ...
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>> ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is
>> no longer the leader..
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
>> SlotPool.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping
>> the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
>> 2019-02-28 19:05:09,449 INFO
>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
>> SlotPool.
>> 2019-02-28 19:05:09,449 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Stopping ZooKeeperLeaderElectionService
>> ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
>> 2019-02-28 19:05:09,452 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.
>>
>

Reply via email to