Re: After job cancel, leftover ZK state prevents job manager startup

Micah Wylde Wed, 12 Dec 2018 11:47:29 -0800

We have a single Jobmanager in an HA setup. From looking at logs and
metrics, it appears that before the issue occurred there was a long (15s)
GC pause on the jobmanager, which then caused a leadership election.
Because there is only one jobmanager, the same one became leader again
after it recovered.

In the logs, we see:

2018-11-15 19:28:10.039 INFO Client session timed out, have not heard from
server in 27384ms for sessionid 0x3655a5782843c03, closing socket
connection and attempting reconnect
2018-11-15 19:28:10.138 INFO State change: SUSPENDED
2018-11-15 19:28:10.141 WARN Connection to ZooKeeper suspended. Can no
longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.144 INFO JobManager for job {JOB}. Interval: 300000 ms.
Lookback: 3600000 ms, KafkaOffsetMins: 0 min
(853c2f252ac2e44de63025ec36efefdc) was revoked leadership at
akka.tcp://flink@{job}-production-internal.lyft.net:6123/user/jobmanager_0.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The
contender akka.tcp://flink@{job}-
production-internal.lyft.net:6123/user/dispatcher no longer participates in
the leader election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no
longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The
contender akka.tcp://flink@{job}-
production-internal.lyft.net:6123/user/jobmanager_0 no longer participates
in the leader election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The
contender http://localhost:18080 no longer participates in the leader
election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The
contender akka.tcp://flink@{job}-
production-internal.lyft.net:6123/user/resourcemanager no longer
participates in the leader election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no
longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.148 WARN ZooKeeper connection SUSPENDING. Changes to
the submitted job graphs are not monitored (temporarily).
2018-11-15 19:28:10.148 INFO http://localhost:18080 lost leadership
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no
longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.440 WARN SASL configuration failed:
javax.security.auth.login.LoginException: No JAAS configuration section
named 'Client' was found in specified JAAS configuration file:
'/tmp/jaas-265845598028730621.conf'. Will continue connection to Zookeeper
server without SASL authentication, if Zookeeper server allows it.
2018-11-15 19:28:10.440 INFO Opening socket connection to server
10.0.39.156/10.0.39.156:2181
2018-11-15 19:28:10.440 ERROR Authentication failed
2018-11-15 19:28:10.441 INFO Socket connection established to
10.0.39.156/10.0.39.156:2181, initiating session
2018-11-15 19:28:10.459 INFO Session establishment complete on server
10.0.39.156/10.0.39.156:2181, sessionid = 0x3655a5782843c03, negotiated
timeout = 40000
2018-11-15 19:28:10.459 INFO State change: RECONNECTED
2018-11-15 19:28:10.460 INFO Connection to ZooKeeper was reconnected.
Leader retrieval can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader retrieval can be restarted.
2018-11-15 19:28:10.461 INFO ZooKeeper connection RECONNECTED. Changes to
the submitted job graphs are monitored again.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader retrieval can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader retrieval can be restarted.
2018-11-15 19:28:10.470 INFO JobManager runner for {job}. Interval: 300000
ms. Lookback: 3600000 ms, KafkaOffsetMins: 0 min
(853c2f252ac2e44de63025ec36efefdc) was granted leadership with session id
3409e3b7-8824-4352-a44d-9dac4b101221 at akka.tcp://flink@{job}-
production-internal.lyft.net:6123/user/jobmanager_0.
2018-11-15 19:28:10.480 INFO http://localhost:18080 was granted leadership
with leaderSessionID=764d2013-5812-4721-be24-1c47a58c0314

On Tue, Dec 11, 2018 at 1:35 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Micah,
>
> the problem looks indeed similar to FLINK-10255. Could you tell me your
> exact cluster setup (HA with stand by JobManagers?). Moreover, the logs of
> all JobManagers on DEBUG level would be helpful for further debugging.
>
> Cheers,
> Till
>
> On Tue, Dec 11, 2018 at 10:09 AM Stefan Richter <
> s.rich...@data-artisans.com> wrote:
>
>> Hi,
>>
>> Thanks for reporting the problem, I think the exception trace looks
>> indeed very similar to traces in the discussion for FLINK-10184. I will
>> pull in Till who worked on the fix to hear his opinion. Maybe the current
>> fix only made the problem less likely to appear but is not complete, yet?
>>
>> Best,
>> Stefan
>>
>> > On 11. Dec 2018, at 05:19, Micah Wylde <mwy...@lyft.com> wrote:
>> >
>> > Hello,
>> >
>> > We've been seeing an issue with several Flink 1.5.4 clusters that looks
>> like this:
>> >
>> > 1. Job is cancelled with a savepoint
>> > 2. The jar is deleted from our HA blobstore (S3)
>> > 3. The jobgraph in ZK is *not* deleted
>> > 4. We restart the cluster
>> > 5. Startup fails in recovery because the jar is not available, with the
>> stacktrace:
>> >
>> > 00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred
>> in the cluster entrypoint.
>> > {{ java.lang.RuntimeException:
>> org.apache.flink.runtime.client.JobExecutionException: Could not set up
>> JobManager}}
>> > {{ at
>> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
>> > {{ at
>> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
>> > {{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
>> > {{ at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
>> > {{ at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
>> > {{ at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
>> > {{ at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
>> > {{ at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused
>> by: java.lang.Exception: Cannot set up the user code libraries: No such
>> file or directory:
>> s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0
>> >
>> > This superficially seems similar to several issues that have apparently
>> been fixed in 1.5.4, like FLINK-10255 and FLINK-10184.
>> >
>> > Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any
>> advice for debugging?
>> >
>> > Thanks,
>> > Micah
>>
>>

Re: After job cancel, leftover ZK state prevents job manager startup

Reply via email to