Re: [1.7.1] job stuck in suspended state

Till Rohrmann Wed, 06 Mar 2019 07:35:12 -0800

Hi Steven,

I think I found the problem. It is caused by a JobMaster which takes a long
time to suspend the job and multiple leader changes. So what happens after
the first leadership revoking and regaining is that the Dispatcher recovers
the submitted job but waits to execute it because the JobMaster from the
previous leader session has not been stopped yet. This is done by
scheduling some futures with the current main thread executor. Now what
happens next is another cycle of leadership revoking and regaining (3rd
time that the Dispatcher receives the leadership). Due to this, we create a
new main thread executor and invalidate the previous one. As a consequence
the created future from the 2nd leadership will never be executed. This
alone is not a problem but since we serialize the leadership gaining
operations, we actually wait for this future to complete before starting
the job recovery belonging to the 3rd leadership.


Long story short: FLINK-11537 should make this problem very unlikely to
occur because the JobMaster termination is now almost instantaneous.
However it does not fully fix the underlying problem. Here is the JIRA
issue for fixing the problem:
https://issues.apache.org/jira/browse/FLINK-11843

Cheers,
Till

On Wed, Mar 6, 2019 at 1:21 PM Till Rohrmann <[email protected]> wrote:

> Hi Steven,
>
> a quick update from my side after looking through the logs. The problem
> seems to be that the Dispatcher does not start recovering the jobs after
> regaining the leadership after it lost it before. I cannot yet tell why
> this is happening and I try to further debug the problem.
>
> If you manage to reproduce the problem, could you maybe run the cluster
> with DEBUG log levels and send me again the logs? If I don't manage to
> figure out how this problem happens until then, it would be super helpful.
>
> Cheers,
> Till
>
> On Mon, Mar 4, 2019 at 7:44 PM Steven Wu <[email protected]> wrote:
>
>> Till,
>>
>> I will send you the complete log offline. We don't know how to reliably
>> reproduce the problem. but it did happen quite frequently, like once every
>> a couple of days. Let me see if I can cherry pick the fix/commit to 1.7
>> branch.
>>
>> Thanks,
>> Steven
>>
>>
>> On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann <[email protected]>
>> wrote:
>>
>>> Hi Steven,
>>>
>>> is this the tail of the logs or are there other statements following? I
>>> think your problem could indeed be related to FLINK-11537. Is it possible
>>> to somehow reliably reproduce this problem? If yes, then you could try out
>>> the RC for Flink 1.8.0 which should be published in the next days.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <[email protected]> wrote:
>>>
>>>> We have observe that sometimes job stuck in suspended state, and no job
>>>> restart/recover were attempted once job is suspended.
>>>> * it is a high-parallelism job (like close to 2,000)
>>>> * there were a few job restarts before this
>>>> * there were high GC pause during the period
>>>> * zookeeper timeout. probably caused by high GC pause
>>>>
>>>> Is it related to https://issues.apache.org/jira/browse/FLINK-11537?
>>>>
>>>> I pasted some logs in the end.
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>> 2019-02-28 19:04:36,357 WARN
>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>>>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>>>> configuration section named 'Client' was found in speci
>>>> fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'.
>>>> Will continue connection to Zookeeper server without SASL authentication,
>>>> if Zookeeper server allows it.
>>>> 2019-02-28 19:04:36,357 INFO
>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>> Opening socket connection to server 100.82.141.106/100.82.141.106:2181
>>>> 2019-02-28 19:04:36,357 ERROR
>>>> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>>>> Authentication failed
>>>> 2019-02-28 19:04:36,357 INFO
>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
>>>> connection established to 100.82.141.106/100.82.141.106:2181,
>>>> initiating session
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>> Session establishment complete on server
>>>> 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2,
>>>> negotiated timeout = 40000
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>>>> - State change: RECONNECTED
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
>>>> monitored again.
>>>> 2019-02-28 19:04:36,360 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
>>>> ...
>>>> 2019-02-28 19:05:09,400 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>>> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>>>> 2019-02-28 19:05:09,400 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
>>>> RESTARTING to SUSPENDING.
>>>> org.apache.flink.util.FlinkException: JobManager is no longer the
>>>> leader.
>>>>         at
>>>> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
>>>>         at
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
>>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>         at java.lang.Thread.run(Thread.java:748)
>>>> 2019-02-28 19:05:09,403 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
>>>> SUSPENDING to SUSPENDED.
>>>> 2019-02-28 19:05:09,403 INFO
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping
>>>> checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
>>>> 2019-02-28 19:05:09,403 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Suspending
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>> Shutting down.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>> 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
>>>> ...
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>>>> ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is
>>>> no longer the leader..
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
>>>> SlotPool.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping
>>>> the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
>>>> 2019-02-28 19:05:09,449 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
>>>> SlotPool.
>>>> 2019-02-28 19:05:09,449 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>>> Stopping ZooKeeperLeaderElectionService
>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
>>>> 2019-02-28 19:05:09,452 INFO
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>> Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from 
>>>> ZooKeeper.
>>>>
>>>

Re: [1.7.1] job stuck in suspended state

Reply via email to