We have a single Jobmanager in an HA setup. From looking at logs and metrics, it appears that before the issue occurred there was a long (15s) GC pause on the jobmanager, which then caused a leadership election. Because there is only one jobmanager, the same one became leader again after it recovered.
In the logs, we see: 2018-11-15 19:28:10.039 INFO Client session timed out, have not heard from server in 27384ms for sessionid 0x3655a5782843c03, closing socket connection and attempting reconnect 2018-11-15 19:28:10.138 INFO State change: SUSPENDED 2018-11-15 19:28:10.141 WARN Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2018-11-15 19:28:10.144 INFO JobManager for job {JOB}. Interval: 300000 ms. Lookback: 3600000 ms, KafkaOffsetMins: 0 min (853c2f252ac2e44de63025ec36efefdc) was revoked leadership at akka.tcp://flink@{job}-production-internal.lyft.net:6123/user/jobmanager_0. 2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The contender akka.tcp://flink@{job}- production-internal.lyft.net:6123/user/dispatcher no longer participates in the leader election. 2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The contender akka.tcp://flink@{job}- production-internal.lyft.net:6123/user/jobmanager_0 no longer participates in the leader election. 2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The contender http://localhost:18080 no longer participates in the leader election. 2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The contender akka.tcp://flink@{job}- production-internal.lyft.net:6123/user/resourcemanager no longer participates in the leader election. 2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2018-11-15 19:28:10.148 WARN ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily). 2018-11-15 19:28:10.148 INFO http://localhost:18080 lost leadership 2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2018-11-15 19:28:10.440 WARN SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-265845598028730621.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2018-11-15 19:28:10.440 INFO Opening socket connection to server 10.0.39.156/10.0.39.156:2181 2018-11-15 19:28:10.440 ERROR Authentication failed 2018-11-15 19:28:10.441 INFO Socket connection established to 10.0.39.156/10.0.39.156:2181, initiating session 2018-11-15 19:28:10.459 INFO Session establishment complete on server 10.0.39.156/10.0.39.156:2181, sessionid = 0x3655a5782843c03, negotiated timeout = 40000 2018-11-15 19:28:10.459 INFO State change: RECONNECTED 2018-11-15 19:28:10.460 INFO Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader election can be restarted. 2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader election can be restarted. 2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader election can be restarted. 2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader election can be restarted. 2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2018-11-15 19:28:10.461 INFO ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again. 2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2018-11-15 19:28:10.470 INFO JobManager runner for {job}. Interval: 300000 ms. Lookback: 3600000 ms, KafkaOffsetMins: 0 min (853c2f252ac2e44de63025ec36efefdc) was granted leadership with session id 3409e3b7-8824-4352-a44d-9dac4b101221 at akka.tcp://flink@{job}- production-internal.lyft.net:6123/user/jobmanager_0. 2018-11-15 19:28:10.480 INFO http://localhost:18080 was granted leadership with leaderSessionID=764d2013-5812-4721-be24-1c47a58c0314 On Tue, Dec 11, 2018 at 1:35 AM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Micah, > > the problem looks indeed similar to FLINK-10255. Could you tell me your > exact cluster setup (HA with stand by JobManagers?). Moreover, the logs of > all JobManagers on DEBUG level would be helpful for further debugging. > > Cheers, > Till > > On Tue, Dec 11, 2018 at 10:09 AM Stefan Richter < > s.rich...@data-artisans.com> wrote: > >> Hi, >> >> Thanks for reporting the problem, I think the exception trace looks >> indeed very similar to traces in the discussion for FLINK-10184. I will >> pull in Till who worked on the fix to hear his opinion. Maybe the current >> fix only made the problem less likely to appear but is not complete, yet? >> >> Best, >> Stefan >> >> > On 11. Dec 2018, at 05:19, Micah Wylde <mwy...@lyft.com> wrote: >> > >> > Hello, >> > >> > We've been seeing an issue with several Flink 1.5.4 clusters that looks >> like this: >> > >> > 1. Job is cancelled with a savepoint >> > 2. The jar is deleted from our HA blobstore (S3) >> > 3. The jobgraph in ZK is *not* deleted >> > 4. We restart the cluster >> > 5. Startup fails in recovery because the jar is not available, with the >> stacktrace: >> > >> > 00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred >> in the cluster entrypoint. >> > {{ java.lang.RuntimeException: >> org.apache.flink.runtime.client.JobExecutionException: Could not set up >> JobManager}} >> > {{ at >> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}} >> > {{ at >> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}} >> > {{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}} >> > {{ at >> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}} >> > {{ at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}} >> > {{ at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}} >> > {{ at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}} >> > {{ at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused >> by: java.lang.Exception: Cannot set up the user code libraries: No such >> file or directory: >> s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0 >> > >> > This superficially seems similar to several issues that have apparently >> been fixed in 1.5.4, like FLINK-10255 and FLINK-10184. >> > >> > Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any >> advice for debugging? >> > >> > Thanks, >> > Micah >> >>