Bhumika Bayani created FLINK-8624: ------------------------------------- Summary: flink-mesos: The flink rest-api sometimes becomes unresponsive Key: FLINK-8624 URL: https://issues.apache.org/jira/browse/FLINK-8624 Project: Flink Issue Type: Bug Affects Versions: 1.3.2 Reporter: Bhumika Bayani
Sometimes flink-mesos-scheduler fails/get killed, and marathon brings it up again on some other node. Sometimes we have observed, the rest-api of the newly created flink instance becomes unresponsive. Even if we execute api calls manually with curl, such as http://<host>:<port>/overview or http://<host>:<port>/config we do not receive any response. We submit and execute all our flink-jobs using rest-api only. So if rest api becomes un-responsive, that stops us from running any of the flink jobs and no stream processing happens. We tried enabling flink debug logs, but we did not observer anything specific that indicates why rest api is failing/unresponsive. We see below exceptions in logs but that is not specific to case when flink-api is hung. We see them in healthy flink-scheduler too: {code:java} Timestamp=2018-02-08 05:43:49,175 LogLevel=INFO ThreadId=[Checkpoint Timer] Class=o.a.f.r.c.CheckpointCoordinator Msg=Triggering checkpoint 10181 @ 1518068629174 Timestamp=2018-02-08 05:43:49,183 LogLevel=DEBUG ThreadId=[nioEventLoopGroup-5-3] Class=o.a.f.r.w.WebRuntimeMonitor Msg=Unhandled exception: {} akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager#753807801]] after [10000 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT] at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT] at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT] at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT] at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT] at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] {code} During the time rest api is unresponsive, we have observed flink web UI too does not load/show any information. Restarting the flink-scheduler solves this issue sometimes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)