Hi Till, Greatly appreciate your reply. We use version 1.4.2. I do not see nothing unusual in the logs for TM that was lost. Note: I have looked at many such failures and see the same below pattern.
The JM logs above had most of what I had, but the below is what I have when I search for flink.yarn (we have huge logs otherwise, given the amount of SQL queries we run). The gist is Akka detecs unreachable, TM marked lost and unregistered by JM, operators start failing with NoResourceAvailableException since there was one less TM, 5 retry attempts later job goes down. …………. 2018-08-29 23:02:41,216 INFO org.apache.flink.yarn.YarnFlinkResourceManager - TaskManager container_e27_1535135887442_0906_01_000124 has started. 2018-08-29 23:02:50,095 INFO org.apache.flink.yarn.YarnFlinkResourceManager - TaskManager container_e27_1535135887442_0906_01_000159 has started. 2018-08-29 23:02:50,409 INFO org.apache.flink.yarn.YarnJobManager - Submitting job ab7e96a1659fbb4bfb3c6cd9bccb0335 (streaming-searches-prod). 2018-08-29 23:02:50,429 INFO org.apache.flink.yarn.YarnJobManager - Using restart strategy FixedDelayRestartStrategy(maxNumberRestartAttempts=5, delayBetweenRestartAttempts=10000) for ab7e96a1659fbb4bfb3c6cd9bccb0335. 2018-08-29 23:02:50,486 INFO org.apache.flink.yarn.YarnJobManager - Running initialization on master for job streaming-searches-prod (ab7e96a1659fbb4bfb3c6cd9bccb0335). 2018-08-29 23:02:50,487 INFO org.apache.flink.yarn.YarnJobManager - Successfully ran initialization on master in 0 ms. 2018-08-29 23:02:50,684 INFO org.apache.flink.yarn.YarnJobManager - Using application-defined state backend for checkpoint/savepoint metadata: File State Backend @ hdfs://security-temp/savedSearches/checkpoint. 2018-08-29 23:02:50,920 INFO org.apache.flink.yarn.YarnJobManager - Scheduling job ab7e96a1659fbb4bfb3c6cd9bccb0335 (streaming-searches-prod). 2018-08-29 23:12:05,240 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover all jobs. 2018-08-29 23:12:05,716 INFO org.apache.flink.yarn.YarnJobManager - There are 1 jobs to recover. Starting the job recovery. 2018-08-29 23:12:05,806 INFO org.apache.flink.yarn.YarnJobManager - Attempting to recover job ab7e96a1659fbb4bfb3c6cd9bccb0335. 2018-08-29 23:12:07,308 INFO org.apache.flink.yarn.YarnJobManager - Ignoring job recovery for ab7e96a1659fbb4bfb3c6cd9bccb0335, because it is already submitted. 2018-08-29 23:13:57,364 INFO org.apache.flink.yarn.YarnJobManager - Task manager akka.tcp://fl...@blahabc.sfdc.net:123/user/taskmanager terminated. at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:110) at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:110) 2018-08-29 23:13:58,645 INFO org.apache.flink.yarn.YarnJobManager - Received message for non-existing checkpoint 1 2018-08-29 23:15:26,126 INFO org.apache.flink.yarn.YarnJobManager - Job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335 is in terminal state FAILED. Shutting down session 2018-08-29 23:15:26,128 INFO org.apache.flink.yarn.YarnJobManager - Stopping JobManager with final application status FAILED and diagnostics: The monitored job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335 has failed to complete. 2018-08-29 23:15:26,129 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Shutting down cluster with status FAILED : The monitored job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335 has failed to complete. 2018-08-29 23:15:26,146 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Unregistering application from the YARN Resource Manager 2018-08-29 23:15:31,363 INFO org.apache.flink.yarn.YarnJobManager - Deleting yarn application files under hdfs://security-temp/user/sec-data-app/.flink/application_1535135887442_0906. 2018-08-29 23:15:31,370 INFO org.apache.flink.yarn.YarnJobManager - Stopping JobManager akka.tcp://fl...@blahxyz.sfdc.net <http://blahabc.sfdc.net/>:1235/user/jobmanager. 2018-08-29 23:15:41,225 ERROR org.apache.flink.yarn.YarnJobManager - Actor system shut down timed out. 2018-08-29 23:15:41,226 INFO org.apache.flink.yarn.YarnJobManager - Shutdown completed. Stopping JVM. On Thu, Aug 30, 2018 at 1:55 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Subramanya, > > in order to help you I need a little bit of context. Which version of > Flink are you running? The configuration yarn.reallocate-failed is > deprecated since version Flink 1.1 and does not have an effect anymore. > > What would be helpful is to get the full jobmanager log from you. If the > YarnFlinkResourceManager gets notified that a container has failed, it > should restart this container (it will do this 145 times). So if the > YarnFlinkResourceManager does not get notified about a completed container, > then this might indicate that the container is still running. So it would > be good to check what the logs of container_e27_1535135887442_0906_01_000039 > say. > > Moreover, do you see the same problem occur when using the latest release > Flink 1.6.0? > > Cheers, > Till > > On Thu, Aug 30, 2018 at 8:48 AM Subramanya Suresh <ssur...@salesforce.com> > wrote: > >> Hi, we are seeing a weird issue where one TaskManager is lost and then >> never re-allocated and subsequently operators fail with >> NoResourceAvailableException and after 5 restarts (we have FixedDelay >> restarts of 5) the application goes down. >> >> - We have explicitly set *yarn.reallocate-failed: *true and have not >> specified the yarn.maximum-failed-containers (and see >> “org.apache.flink.yarn.YarnApplicationMasterRunner - YARN >> application tolerates 145 failed TaskManager containers before giving up” >> in the logs). >> - After the initial startup where all 145 TaskManagers are requested >> I never see any logs saying “Requesting new TaskManager container” to >> reallocate failed container. >> >> >> 2018-08-29 23:13:56,655 WARN akka.remote.RemoteWatcher >> - Detected unreachable: [akka.tcp://flink@blahabc. >> sfdc.net:123] >> >> 2018-08-29 23:13:57,364 INFO org.apache.flink.yarn.YarnJobManager >> - Task manager akka.tcp://fl...@blahabc.sfdc. >> net:123/user/taskmanager terminated. >> >> java.lang.Exception: TaskManager was lost/killed: >> container_e27_1535135887442_0906_01_000039 @ blahabc.sfdc.net >> (dataPort=124) >> >> at org.apache.flink.runtime.instance.SimpleSlot. >> releaseSlot(SimpleSlot.java:217) >> >> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment. >> releaseSharedSlot(SlotSharingGroupAssignment.java:523) >> >> at org.apache.flink.runtime.instance.SharedSlot. >> releaseSlot(SharedSlot.java:192) >> >> at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) >> >> at org.apache.flink.runtime.instance.InstanceManager. >> unregisterTaskManager(InstanceManager.java:212) >> >> at org.apache.flink.runtime.jobmanager.JobManager.org$ >> apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated( >> JobManager.scala:1198) >> >> at org.apache.flink.runtime.jobmanager.JobManager$$ >> anonfun$handleMessage$1.applyOrElse(JobManager.scala:1096) >> >> at scala.runtime.AbstractPartialFunction.apply( >> AbstractPartialFunction.scala:36) >> >> at org.apache.flink.runtime.clusterframework.ContaineredJobManager$$ >> anonfun$handleContainerMessage$1.applyOrElse(ContaineredJobManager.scala: >> 107) >> >> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) >> >> at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShutdown$1. >> applyOrElse(YarnJobManager.scala:110) >> >> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) >> >> at org.apache.flink.runtime.LeaderSessionMessageFilter$$ >> anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) >> >> at scala.runtime.AbstractPartialFunction.apply( >> AbstractPartialFunction.scala:36) >> >> at org.apache.flink.runtime.LogMessages$$anon$1.apply( >> LogMessages.scala:33) >> >> at org.apache.flink.runtime.LogMessages$$anon$1.apply( >> LogMessages.scala:28) >> >> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> >> at org.apache.flink.runtime.LogMessages$$anon$1. >> applyOrElse(LogMessages.scala:28) >> >> at akka.actor.Actor$class.aroundReceive(Actor.scala:502) >> >> at org.apache.flink.runtime.jobmanager.JobManager. >> aroundReceive(JobManager.scala:122) >> >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) >> >> at akka.actor.dungeon.DeathWatch$class.receivedTerminated( >> DeathWatch.scala:46) >> >> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374) >> >> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511) >> >> at akka.actor.ActorCell.invoke(ActorCell.scala:494) >> >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) >> >> at akka.dispatch.Mailbox.run(Mailbox.scala:224) >> >> at akka.dispatch.Mailbox.exec(Mailbox.scala:234) >> >> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue. >> runTask(ForkJoinPool.java:1339) >> >> at scala.concurrent.forkjoin.ForkJoinPool.runWorker( >> ForkJoinPool.java:1979) >> >> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run( >> ForkJoinWorkerThread.java:107) >> >> java.lang.Exception: TaskManager was lost/killed: >> container_e27_1535135887442_0906_01_000039 @ blahabc.sfdc.net:42414 >> (dataPort=124) >> >> at org.apache.flink.runtime.instance.SimpleSlot. >> releaseSlot(SimpleSlot.java:217) >> >> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment. >> releaseSharedSlot(SlotSharingGroupAssignment.java:523) >> >> at org.apache.flink.runtime.instance.SharedSlot. >> releaseSlot(SharedSlot.java:192) >> >> at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) >> >> at org.apache.flink.runtime.instance.InstanceManager. >> unregisterTaskManager(InstanceManager.java:212) >> >> at org.apache.flink.runtime.jobmanager.JobManager.org$ >> apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated( >> JobManager.scala:1198) >> >> at org.apache.flink.runtime.jobmanager.JobManager$$ >> anonfun$handleMessage$1.applyOrElse(JobManager.scala:1096) >> >> at scala.runtime.AbstractPartialFunction.apply( >> AbstractPartialFunction.scala:36) >> >> at org.apache.flink.runtime.clusterframework.ContaineredJobManager$$ >> anonfun$handleContainerMessage$1.applyOrElse(ContaineredJobManager.scala: >> 107) >> >> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) >> >> at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShutdown$1. >> applyOrElse(YarnJobManager.scala:110) >> >> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) >> >> at org.apache.flink.runtime.LeaderSessionMessageFilter$$ >> anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) >> >> at scala.runtime.AbstractPartialFunction.apply( >> AbstractPartialFunction.scala:36) >> >> at org.apache.flink.runtime.LogMessages$$anon$1.apply( >> LogMessages.scala:33) >> >> at org.apache.flink.runtime.LogMessages$$anon$1.apply( >> LogMessages.scala:28) >> >> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> >> at org.apache.flink.runtime.LogMessages$$anon$1. >> applyOrElse(LogMessages.scala:28) >> >> at akka.actor.Actor$class.aroundReceive(Actor.scala:502) >> >> at org.apache.flink.runtime.jobmanager.JobManager. >> aroundReceive(JobManager.scala:122) >> >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) >> >> at akka.actor.dungeon.DeathWatch$class.receivedTerminated( >> DeathWatch.scala:46) >> >> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374) >> >> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511) >> >> at akka.actor.ActorCell.invoke(ActorCell.scala:494) >> >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) >> >> at akka.dispatch.Mailbox.run(Mailbox.scala:224) >> >> at akka.dispatch.Mailbox.exec(Mailbox.scala:234) >> >> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue. >> runTask(ForkJoinPool.java:1339) >> >> at scala.concurrent.forkjoin.ForkJoinPool.runWorker( >> ForkJoinPool.java:1979) >> >> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run( >> ForkJoinWorkerThread.java:107) >> >> 2018-08-29 23:13:58,529 INFO >> org.apache.flink.runtime.instance.InstanceManager >> - Unregistered task manager blahabc.sfdc.net:/1.1.1.1. >> Number of registered task managers 144. Number of available slots 720. >> >> 2018-08-29 23:13:58,645 INFO org.apache.flink.yarn.YarnJobManager >> - Received message for non-existing checkpoint 1 >> >> 2018-08-29 23:14:39,969 INFO org.apache.flink.runtime.checkpoint. >> ZooKeeperCompletedCheckpointStore - Recovering checkpoints from >> ZooKeeper. >> >> 2018-08-29 23:14:39,975 INFO org.apache.flink.runtime.checkpoint. >> ZooKeeperCompletedCheckpointStore - Found 0 checkpoints in ZooKeeper. >> >> 2018-08-29 23:14:39,975 INFO org.apache.flink.runtime.checkpoint. >> ZooKeeperCompletedCheckpointStore - Trying to fetch 0 checkpoints from >> storage. >> >> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Not enough free slots available to run the job. You can decrease the >> operator parallelism or increase the number of slots per TaskManager in the >> configuration. Task to schedule: < Attempt #1 >> >> >> *After 5 retries of our Sql query execution graph (we have configured 5 >> fixed delay restart), it outputs the below, * >> >> >> 2018-08-29 23:15:22,216 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >> - Stopping checkpoint coordinator for job >> ab7e96a1659fbb4bfb3c6cd9bccb0335 >> >> 2018-08-29 23:15:22,216 INFO org.apache.flink.runtime.checkpoint. >> ZooKeeperCompletedCheckpointStore - Shutting down >> >> 2018-08-29 23:15:22,225 INFO org.apache.flink.runtime.checkpoint. >> ZooKeeperCompletedCheckpointStore - Removing /prod/link/application_ >> 1535135887442_0906/checkpoints/ab7e96a1659fbb4bfb3c6cd9bccb0335 from >> ZooKeeper >> >> 2018-08-29 23:15:23,460 INFO org.apache.flink.runtime.checkpoint. >> ZooKeeperCheckpointIDCounter - Shutting down. >> >> 2018-08-29 23:15:23,460 INFO org.apache.flink.runtime.checkpoint. >> ZooKeeperCheckpointIDCounter - Removing /checkpoint-counter/ >> ab7e96a1659fbb4bfb3c6cd9bccb0335 from ZooKeeper >> >> 2018-08-29 23:15:24,114 INFO org.apache.flink.runtime.jobmanager. >> ZooKeeperSubmittedJobGraphStore - Removed job graph >> ab7e96a1659fbb4bfb3c6cd9bccb0335 from ZooKeeper. >> >> 2018-08-29 23:15:26,126 INFO org.apache.flink.yarn.YarnJobManager >> - Job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335 is in >> terminal state FAILED. Shutting down session >> >> 2018-08-29 23:15:26,128 INFO org.apache.flink.yarn.YarnJobManager >> - Stopping JobManager with final application status >> FAILED and diagnostics: The monitored job with ID >> ab7e96a1659fbb4bfb3c6cd9bccb0335 has failed to complete. >> >> 2018-08-29 23:15:26,129 INFO org.apache.flink.yarn. >> YarnFlinkResourceManager - Shutting down cluster with >> status FAILED : The monitored job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335 >> has failed to complete. >> >> 2018-08-29 23:15:26,146 INFO org.apache.flink.yarn. >> YarnFlinkResourceManager - Unregistering application from >> the YARN Resource Manager >> >> 2018-08-29 23:15:27,997 INFO org.apache.flink.runtime.history.FsJobArchivist >> - Job ab7e96a1659fbb4bfb3c6cd9bccb0335 has been archived >> at hdfs:/savedSearches/prod/completed-jobs/ab7e96a1659fbb4bfb3c6cd9bccb03 >> 35. >> >> >> Cheers, >> > -- <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html>