Re: Job goes down after 1/145 TMs is lost (NoResourceAvailableException)

Subramanya Suresh Wed, 05 Sep 2018 19:59:10 -0700

Hi,
With some additional research,

*Before the flag*
I realized for failed containers (that exited for a specific  we still were
Requesting new TM container and launching TM). But for the "Detected
unreachable: [akka.tcp://fl...@blahabc.sfdc.net:123]" issue I do not see
the container marked as failed and a subsequent request for TM.


*After taskmanager.exit-on-fatal-akka-error: true` *
I do not see any unreachable: [akka.tcp://fl...@blahabc.sfdc.net:123] since
we turned on taskmanager.exit-on-fatal-akka-error: true, I am not sure if
that is a coincidence or a direct impact of this change.

*Our Issue:*
I realized we are still exiting the application, i.e. failing when the
containers are lost. The reason for this is before
org.apache.flink.yarn.YarnFlinkResouceManager is able to acquire a new
container TM, launch TM and it is reported as started, the
org.apache.flink.runtime.jobmanager.scheduler
throws a NoResourceAvailableException that causes a failure. In our case we
had fixed restart strategy with 5, and we are running out of it because of
this. I am looking to solve this with a FailureRateRestartStrategy over 2
minutes interval (10 second restart delay, >12 failures), that lets the TM
come back (takes about 50 seconds).

*Flink Bug*
But I cannot help but think why there is no interaction between the
ResourceManager and JobManager, i.e. why is the jobmanager continuing with
the processing despite not having the required TMs ?

Logs to substantiate what I said above (only relevant).

018-09-03 06:17:13,932 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Container
container_e28_1535135887442_1381_01_000097 completed successfully with
diagnostics: Container released by application


2018-09-03 06:34:19,214 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system [akka.tcp://
fl...@hello-world9-27-crz.ops.sfdc.net:46219] has failed, address is now
gated for [5000] ms. Reason: [Disassociated]
2018-09-03 06:34:19,214 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Container
container_e27_1535135887442_1381_01_000102 failed. Exit status: -102
2018-09-03 06:34:19,215 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Diagnostics
for container container_e27_1535135887442_1381_01_000102 in state COMPLETE
: exitStatus=-102 diagnostics=Container preempted by scheduler
2018-09-03 06:34:19,215 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Total
number of failed containers so far: 1
2018-09-03 06:34:19,215 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Requesting
new TaskManager container with 20000 megabytes memory. Pending requests: 1
2018-09-03 06:34:19,216 INFO  org.apache.flink.yarn.YarnJobManager
                - Task manager akka.tcp://
fl...@hello-world9-27-crz.ops.sfdc.net:46219/user/taskmanager terminated.
2018-09-03 06:34:19,218 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
streaming-searches-prod (96d3b4f60a80a898f44f87c5b06f6981) switched from
state RUNNING to FAILING.
java.lang.Exception: TaskManager was lost/killed:
container_e27_1535135887442_1381_01_000102 @
hello-world9-27-crz.ops.sfdc.net (dataPort=40423)
2018-09-03 06:34:19,466 INFO
org.apache.flink.runtime.instance.InstanceManager             -
Unregistered task manager hello-world9-27-crz.ops.sfdc.net/11.11.35.220.
Number of registered task managers 144. Number of available slots 720

2018-09-03 06:34:24,717 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Received
new container: container_e28_1535135887442_1381_01_000147 - Remaining
pending container requests: 0
2018-09-03 06:34:24,717 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Launching
TaskManager in container ContainerInLaunch @ 1535956464717: Container:
[ContainerId: container_e28_1535135887442_1381_01_000147, NodeId:
hello-world9-27-crz.ops.sfdc.net:8041, NodeHttpAddress:
hello-world9-27-crz.ops.sfdc.net:8042, Resource: <memory:20480, vCores:5>,
Priority: 0, Token: Token { kind: ContainerToken, service: 11.11.35.220:8041
}, ] on host hello-world9-27-crz.ops.sfdc.net
2018-09-03 06:34:29,256 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system [akka.tcp://
fl...@hello-world9-27-crz.ops.sfdc.net:46219] has failed, address is now
gated for [5000] ms. Reason: [Association failed with [akka.tcp://
fl...@hello-world9-27-crz.ops.sfdc.net:46219]] Caused by: [Connection
refused: hello-world9-27-crz.ops.sfdc.net/11.11.35.220:46219]
2018-09-03 06:34:34,284 WARN  akka.remote.transport.netty.NettyTransport
                - Remote connection to [null] failed with
java.net.ConnectException: Connection refused:
hello-world9-27-crz.ops.sfdc.net/11.11.35.220:46219
2018-09-03 06:34:34,284 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system [akka.tcp://
fl...@hello-world9-27-crz.ops.sfdc.net:46219] has failed, address is now
gated for [5000] ms. Reason: [Association failed with [akka.tcp://
fl...@hello-world9-27-crz.ops.sfdc.net:46219]] Caused by: [Connection
refused: hello-world9-27-crz.ops.sfdc.net/11.11.35.220:46219]
2018-09-03 06:34:34,540 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
streaming-searches-prod (96d3b4f60a80a898f44f87c5b06f6981) switched from
state RESTARTING to CREATED.

2018-09-03 06:34:34,284 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system [akka.tcp://
fl...@hello-world9-27-crz.ops.sfdc.net:46219] has failed, address is now
gated for [5000] ms. Reason: [Association failed with [akka.tcp://
fl...@hello-world9-27-crz.ops.sfdc.net:46219]] Caused by: [Connection
refused: hello-world9-27-crz.ops.sfdc.net/11.11.35.220:46219]
2018-09-03 06:34:34,540 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
streaming-searches-prod (96d3b4f60a80a898f44f87c5b06f6981) switched from
state RESTARTING to CREATED.
2018-09-03 06:34:35,044 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
streaming-searches-prod (96d3b4f60a80a898f44f87c5b06f6981) switched from
state CREATED to RUNNING.
2018-09-03 06:34:35,195 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
streaming-searches-prod (96d3b4f60a80a898f44f87c5b06f6981) switched from
state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Not enough free slots available to run the job. You can decrease the
operator parallelism or increase the number of slots per TaskManager in the
configuration. Task to schedule: < Attempt #3 (Source: Custom Source ->
(Map -> where: (AND(OR(=(UPPER(RuleMatch), _UTF-16LE'BRO'),
=(UPPER(RuleMatch), _UTF-16LE'PAN'), =(UPPER(RuleMatch),
_UTF-16LE'OLYMPUS'), =(UPPER(RuleMatch), _UTF-16LE'BIT9'),
=(UPPER(RuleMatch), _UTF-16LE'OSQUERY'), =(UPPER(RuleMatch),
_UTF-16LE'ILLUMIO')), OR(cidrMatch(IPSource, _UTF-16LE'10.0.0.0/8'),
cidrMatch(IPSource, _UTF-16LE'192.168.0.0/16')), IS
NULL(csvLookup(_UTF-16LE'a353A000000jGZb_whitelist.csv', _UTF




2018-09-03 06:34:39,248 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - TaskManager
container_e28_1535135887442_1381_01_000147 has started.
2018-09-03 06:34:45,235 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Container
container_e28_1535135887442_1381_01_000147 failed. Exit status: -102
2018-09-03 06:34:45,235 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Diagnostics
for container container_e28_1535135887442_1381_01_000147 in state COMPLETE
: exitStatus=-102 diagnostics=Container preempted by scheduler
2018-09-03 06:34:45,236 INFO  org.apache.flink.yarn.YarnJobManager
                - Task manager akka.tcp://
fl...@hello-world9-27-crz.ops.sfdc.net:41966/user/taskmanager terminated.
2018-09-03 06:34:45,236 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Total
number of failed containers so far: 2
2018-09-03 06:34:45,236 INFO
org.apache.flink.runtime.instance.InstanceManager             -
Unregistered task manager hello-world9-27-crz.ops.sfdc.net/11.11.35.220.
Number of registered task managers 144. Number of available slots 720.

2018-09-03 06:34:45,236 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Received
new container: container_e28_1535135887442_1381_01_000202 - Remaining
pending container requests: 0
2018-09-03 06:34:45,236 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Launching
TaskManager in container ContainerInLaunch @ 1535956485236: Container:
[ContainerId: container_e28_1535135887442_1381_01_000202, NodeId:
hello-world4-31-crz.ops.sfdc.net:8041, NodeHttpAddress:
hello-world4-31-crz.ops.sfdc.net:8042, Resource: <memory:20480, vCores:5>,
Priority: 0, Token: Token { kind: ContainerToken, service: 11.11.34.160:8041
}, ] on host hello-world4-31-crz.ops.sfdc.net
2018-09-03 06:34:45,241 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Received
new container: container_e28_1535135887442_1381_01_000203 - Remaining
pending container requests: 0
2018-09-03 06:34:45,241 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Returning
excess container container_e28_1535135887442_1381_01_000203

Notice there is no TaskManager container_e28_1535135887442_1381_01_000202
has started.
I see
2018-09-03 06:34:56,894 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
streaming-searches-prod (96d3b4f60a80a898f44f87c5b06f6981) switched from
state FAILING to FAILED.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Not enough free slots available to run the job. You can decrease the
operator parallelism or increase the number of slots per TaskManager in the
configuration. Task to schedule: < Attempt #5 (Source: Custom Source ->
(Map -> where: (AND(OR(=(UPPER(RuleMatch), _UTF-16LE'BRO'),
=(UPPER(RuleMatch), _UTF-16LE'PAN'), =(UPPER(RuleMatch),
_UTF-16LE'OLYMPUS'), =(UPPER(RuleMatch), _UTF-16LE'BIT9'), =(UPPER(R


2018-09-03 06:34:57,005 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                - Shutting
down cluster with status FAILED : The monitored job with ID
96d3b4f60a80a898f44f87c5b06f6981 has failed to complete.
2018-09-03 06:34:57,007 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                -
Unregistering application from the YARN Resource Manager




On Fri, Aug 31, 2018 at 4:00 PM, Subramanya Suresh <ssur...@salesforce.com>
wrote:

> Thanks TIll,
> I do not see any Akka related messages in that Taskmanager after the
> initial startup. It seemed like all is well. So after the remotewatcher
> detects it unreachable and the TaskManager unregisters it, I do not see any
> other activity in the JobManager with regards to reallocation etc.
> - Does the quarantining of the TaskManager not happen until  the
> exit-on-fatal-akka-error is turned on ?
> - Does the JobManager or the TaskManager try to reconnect to each other
> again ? Is there a different setting for it ?
> - Does the JobManager not reallocate a TaskManager despite it being
> unregistered, until the TaskManager exits ? I think it should, especially
> if it is not trying to establish a connection again.
>
> I will give the flag a try.
>
> Sincerely,
>
>
> On Fri, Aug 31, 2018 at 2:53 AM, Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Could you check whether akka.tcp://flink@blaha
>> bc.sfdc.net:123/user/taskmanager is still running? E.g. tries to
>> reconnect to the JobManager? If this is the case, then the container is
>> still running and the YarnFlinkResourceManager thinks that everything is
>> alright. You can activate that a TaskManager kills itself if it gets
>> quarantined by setting `taskmanager.exit-on-fatal-akka-error: true` in
>> the `flink-conf.yaml`.
>>
>> Cheers,
>> Till
>>
>> On Fri, Aug 31, 2018 at 10:43 AM Subramanya Suresh <
>> ssur...@salesforce.com> wrote:
>>
>>> Hi Till,
>>> Greatly appreciate your reply.
>>> We use version 1.4.2. I do not see nothing unusual in the logs for TM
>>> that was lost. Note: I have looked at many such failures and see the same
>>> below pattern.
>>>
>>> The JM logs above had most of what I had, but the below is what I have
>>> when I search for flink.yarn (we have huge logs otherwise, given the amount
>>> of SQL queries we run). The gist is Akka detecs unreachable, TM marked lost
>>> and unregistered by JM, operators start failing with
>>> NoResourceAvailableException since there was one less TM, 5 retry attempts
>>> later job goes down.
>>>
>>> ………….
>>>
>>> 2018-08-29 23:02:41,216 INFO  org.apache.flink.yarn.YarnFlin
>>> kResourceManager                - TaskManager
>>> container_e27_1535135887442_0906_01_000124 has started.
>>>
>>> 2018-08-29 23:02:50,095 INFO  org.apache.flink.yarn.YarnFlin
>>> kResourceManager                - TaskManager
>>> container_e27_1535135887442_0906_01_000159 has started.
>>>
>>> 2018-08-29 23:02:50,409 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Submitting job ab7e96a1659fbb4bfb3c6cd9bccb0335
>>> (streaming-searches-prod).
>>>
>>> 2018-08-29 23:02:50,429 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Using restart strategy
>>> FixedDelayRestartStrategy(maxNumberRestartAttempts=5,
>>> delayBetweenRestartAttempts=10000) for ab7e96a1659fbb4bfb3c6cd9bccb0335.
>>>
>>> 2018-08-29 23:02:50,486 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Running initialization on master for job
>>> streaming-searches-prod (ab7e96a1659fbb4bfb3c6cd9bccb0335).
>>>
>>> 2018-08-29 23:02:50,487 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Successfully ran initialization on master in 0 ms.
>>>
>>> 2018-08-29 23:02:50,684 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Using application-defined state backend for
>>> checkpoint/savepoint metadata: File State Backend @
>>> hdfs://security-temp/savedSearches/checkpoint.
>>>
>>> 2018-08-29 23:02:50,920 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Scheduling job ab7e96a1659fbb4bfb3c6cd9bccb0335
>>> (streaming-searches-prod).
>>>
>>> 2018-08-29 23:12:05,240 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Attempting to recover all jobs.
>>>
>>> 2018-08-29 23:12:05,716 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - There are 1 jobs to recover. Starting the job
>>> recovery.
>>>
>>> 2018-08-29 23:12:05,806 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Attempting to recover job
>>> ab7e96a1659fbb4bfb3c6cd9bccb0335.
>>>
>>> 2018-08-29 23:12:07,308 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Ignoring job recovery for
>>> ab7e96a1659fbb4bfb3c6cd9bccb0335, because it is already submitted.
>>>
>>> 2018-08-29 23:13:57,364 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Task manager 
>>> akka.tcp://fl...@blahabc.sfdc.net:123/user/taskmanager
>>> terminated.
>>>
>>> at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShut
>>> down$1.applyOrElse(YarnJobManager.scala:110)
>>>
>>> at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShut
>>> down$1.applyOrElse(YarnJobManager.scala:110)
>>>
>>> 2018-08-29 23:13:58,645 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Received message for non-existing checkpoint 1
>>>
>>> 2018-08-29 23:15:26,126 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335 is
>>> in terminal state FAILED. Shutting down session
>>>
>>> 2018-08-29 23:15:26,128 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Stopping JobManager with final application status
>>> FAILED and diagnostics: The monitored job with ID
>>> ab7e96a1659fbb4bfb3c6cd9bccb0335 has failed to complete.
>>>
>>> 2018-08-29 23:15:26,129 INFO  org.apache.flink.yarn.YarnFlin
>>> kResourceManager                - Shutting down cluster with status
>>> FAILED : The monitored job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335 has
>>> failed to complete.
>>>
>>> 2018-08-29 23:15:26,146 INFO  org.apache.flink.yarn.YarnFlin
>>> kResourceManager                - Unregistering application from the
>>> YARN Resource Manager
>>>
>>> 2018-08-29 23:15:31,363 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Deleting yarn application files under
>>> hdfs://security-temp/user/sec-data-app/.flink/application_15
>>> 35135887442_0906.
>>>
>>> 2018-08-29 23:15:31,370 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Stopping JobManager akka.tcp://fl...@blahxyz.sfdc.
>>> net <http://blahabc.sfdc.net/>:1235/user/jobmanager.
>>>
>>> 2018-08-29 23:15:41,225 ERROR org.apache.flink.yarn.YarnJobManager
>>>                     - Actor system shut down timed out.
>>>
>>> 2018-08-29 23:15:41,226 INFO  org.apache.flink.yarn.YarnJobManager
>>>                     - Shutdown completed. Stopping JVM.
>>>
>>>
>>> On Thu, Aug 30, 2018 at 1:55 AM, Till Rohrmann <trohrm...@apache.org>
>>> wrote:
>>>
>>>> Hi Subramanya,
>>>>
>>>> in order to help you I need a little bit of context. Which version of
>>>> Flink are you running? The configuration yarn.reallocate-failed is
>>>> deprecated since version Flink 1.1 and does not have an effect anymore.
>>>>
>>>> What would be helpful is to get the full jobmanager log from you. If
>>>> the YarnFlinkResourceManager gets notified that a container has failed, it
>>>> should restart this container (it will do this 145 times). So if the
>>>> YarnFlinkResourceManager does not get notified about a completed container,
>>>> then this might indicate that the container is still running. So it would
>>>> be good to check what the logs of 
>>>> container_e27_1535135887442_0906_01_000039
>>>> say.
>>>>
>>>> Moreover, do you see the same problem occur when using the latest
>>>> release Flink 1.6.0?
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Thu, Aug 30, 2018 at 8:48 AM Subramanya Suresh <
>>>> ssur...@salesforce.com> wrote:
>>>>
>>>>> Hi, we are seeing a weird issue where one TaskManager is lost and then
>>>>> never re-allocated and subsequently operators fail with
>>>>> NoResourceAvailableException and after 5 restarts (we have FixedDelay
>>>>> restarts of 5) the application goes down.
>>>>>
>>>>>    - We have explicitly set *yarn.reallocate-failed: *true and have
>>>>>    not specified the yarn.maximum-failed-containers (and see
>>>>>    “org.apache.flink.yarn.YarnApplicationMasterRunner             -
>>>>>    YARN application tolerates 145 failed TaskManager containers before 
>>>>> giving
>>>>>    up” in the logs).
>>>>>    - After the initial startup where all 145 TaskManagers are
>>>>>    requested I never see any logs saying “Requesting new TaskManager
>>>>>    container” to reallocate failed container.
>>>>>
>>>>>
>>>>> 2018-08-29 23:13:56,655 WARN  akka.remote.RemoteWatcher
>>>>>                       - Detected unreachable: [akka.tcp://
>>>>> fl...@blahabc.sfdc.net:123]
>>>>>
>>>>> 2018-08-29 23:13:57,364 INFO  org.apache.flink.yarn.YarnJobManager
>>>>>                       - Task manager akka.tcp://fl...@blahabc.sfdc.
>>>>> net:123/user/taskmanager terminated.
>>>>>
>>>>> java.lang.Exception: TaskManager was lost/killed:
>>>>> container_e27_1535135887442_0906_01_000039 @ blahabc.sfdc.net
>>>>> (dataPort=124)
>>>>>
>>>>> at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(
>>>>> SimpleSlot.java:217)
>>>>>
>>>>> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment
>>>>> .releaseSharedSlot(SlotSharingGroupAssignment.java:523)
>>>>>
>>>>> at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(
>>>>> SharedSlot.java:192)
>>>>>
>>>>> at org.apache.flink.runtime.instance.Instance.markDead(Instance
>>>>> .java:167)
>>>>>
>>>>> at org.apache.flink.runtime.instance.InstanceManager.unregister
>>>>> TaskManager(InstanceManager.java:212)
>>>>>
>>>>> at org.apache.flink.runtime.jobmanager.JobManager.org$apache$
>>>>> flink$runtime$jobmanager$JobManager$$handleTaskManagerT
>>>>> erminated(JobManager.scala:1198)
>>>>>
>>>>> at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$
>>>>> handleMessage$1.applyOrElse(JobManager.scala:1096)
>>>>>
>>>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF
>>>>> unction.scala:36)
>>>>>
>>>>> at org.apache.flink.runtime.clusterframework.ContaineredJobMana
>>>>> ger$$anonfun$handleContainerMessage$1.applyOrElse(Containere
>>>>> dJobManager.scala:107)
>>>>>
>>>>> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>>>>>
>>>>> at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShut
>>>>> down$1.applyOrElse(YarnJobManager.scala:110)
>>>>>
>>>>> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>>>>>
>>>>> at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun
>>>>> $receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>>>>>
>>>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF
>>>>> unction.scala:36)
>>>>>
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag
>>>>> es.scala:33)
>>>>>
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag
>>>>> es.scala:28)
>>>>>
>>>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>>>>
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(
>>>>> LogMessages.scala:28)
>>>>>
>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>>>>
>>>>> at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive
>>>>> (JobManager.scala:122)
>>>>>
>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>>>>
>>>>> at akka.actor.dungeon.DeathWatch$class.receivedTerminated(Death
>>>>> Watch.scala:46)
>>>>>
>>>>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374)
>>>>>
>>>>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511)
>>>>>
>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:494)
>>>>>
>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>>>>
>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>>>>
>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>>>>
>>>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.
>>>>> java:260)
>>>>>
>>>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
>>>>> ForkJoinPool.java:1339)
>>>>>
>>>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo
>>>>> l.java:1979)
>>>>>
>>>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW
>>>>> orkerThread.java:107)
>>>>>
>>>>> java.lang.Exception: TaskManager was lost/killed:
>>>>> container_e27_1535135887442_0906_01_000039 @ blahabc.sfdc.net:42414
>>>>> (dataPort=124)
>>>>>
>>>>> at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(
>>>>> SimpleSlot.java:217)
>>>>>
>>>>> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment
>>>>> .releaseSharedSlot(SlotSharingGroupAssignment.java:523)
>>>>>
>>>>> at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(
>>>>> SharedSlot.java:192)
>>>>>
>>>>> at org.apache.flink.runtime.instance.Instance.markDead(Instance
>>>>> .java:167)
>>>>>
>>>>> at org.apache.flink.runtime.instance.InstanceManager.unregister
>>>>> TaskManager(InstanceManager.java:212)
>>>>>
>>>>> at org.apache.flink.runtime.jobmanager.JobManager.org$apache$
>>>>> flink$runtime$jobmanager$JobManager$$handleTaskManagerT
>>>>> erminated(JobManager.scala:1198)
>>>>>
>>>>> at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$
>>>>> handleMessage$1.applyOrElse(JobManager.scala:1096)
>>>>>
>>>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF
>>>>> unction.scala:36)
>>>>>
>>>>> at org.apache.flink.runtime.clusterframework.ContaineredJobMana
>>>>> ger$$anonfun$handleContainerMessage$1.applyOrElse(Containere
>>>>> dJobManager.scala:107)
>>>>>
>>>>> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>>>>>
>>>>> at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShut
>>>>> down$1.applyOrElse(YarnJobManager.scala:110)
>>>>>
>>>>> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>>>>>
>>>>> at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun
>>>>> $receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>>>>>
>>>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF
>>>>> unction.scala:36)
>>>>>
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag
>>>>> es.scala:33)
>>>>>
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag
>>>>> es.scala:28)
>>>>>
>>>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>>>>
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(
>>>>> LogMessages.scala:28)
>>>>>
>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>>>>
>>>>> at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive
>>>>> (JobManager.scala:122)
>>>>>
>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>>>>
>>>>> at akka.actor.dungeon.DeathWatch$class.receivedTerminated(Death
>>>>> Watch.scala:46)
>>>>>
>>>>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374)
>>>>>
>>>>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511)
>>>>>
>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:494)
>>>>>
>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>>>>
>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>>>>
>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>>>>
>>>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.
>>>>> java:260)
>>>>>
>>>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
>>>>> ForkJoinPool.java:1339)
>>>>>
>>>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo
>>>>> l.java:1979)
>>>>>
>>>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW
>>>>> orkerThread.java:107)
>>>>>
>>>>> 2018-08-29 23:13:58,529 INFO  
>>>>> org.apache.flink.runtime.instance.InstanceManager
>>>>>             - Unregistered task manager blahabc.sfdc.net:/1.1.1.1.
>>>>> Number of registered task managers 144. Number of available slots 720.
>>>>>
>>>>> 2018-08-29 23:13:58,645 INFO  org.apache.flink.yarn.YarnJobManager
>>>>>                       - Received message for non-existing checkpoint 1
>>>>>
>>>>> 2018-08-29 23:14:39,969 INFO  org.apache.flink.runtime.check
>>>>> point.ZooKeeperCompletedCheckpointStore  - Recovering checkpoints
>>>>> from ZooKeeper.
>>>>>
>>>>> 2018-08-29 23:14:39,975 INFO  org.apache.flink.runtime.check
>>>>> point.ZooKeeperCompletedCheckpointStore  - Found 0 checkpoints in
>>>>> ZooKeeper.
>>>>>
>>>>> 2018-08-29 23:14:39,975 INFO  org.apache.flink.runtime.check
>>>>> point.ZooKeeperCompletedCheckpointStore  - Trying to fetch 0
>>>>> checkpoints from storage.
>>>>>
>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>>>> Not enough free slots available to run the job. You can decrease the
>>>>> operator parallelism or increase the number of slots per TaskManager in 
>>>>> the
>>>>> configuration. Task to schedule: < Attempt #1
>>>>>
>>>>>
>>>>> *After 5 retries of our Sql query execution graph (we have configured
>>>>> 5 fixed delay restart), it outputs the below, *
>>>>>
>>>>>
>>>>> 2018-08-29 23:15:22,216 INFO  
>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>     - Stopping checkpoint coordinator for job
>>>>> ab7e96a1659fbb4bfb3c6cd9bccb0335
>>>>>
>>>>> 2018-08-29 23:15:22,216 INFO  org.apache.flink.runtime.check
>>>>> point.ZooKeeperCompletedCheckpointStore  - Shutting down
>>>>>
>>>>> 2018-08-29 23:15:22,225 INFO  org.apache.flink.runtime.check
>>>>> point.ZooKeeperCompletedCheckpointStore  - Removing
>>>>> /prod/link/application_1535135887442_0906/checkpoints/ab7e96a1659fbb4bfb3c6cd9bccb0335
>>>>> from ZooKeeper
>>>>>
>>>>> 2018-08-29 23:15:23,460 INFO  org.apache.flink.runtime.check
>>>>> point.ZooKeeperCheckpointIDCounter  - Shutting down.
>>>>>
>>>>> 2018-08-29 23:15:23,460 INFO  org.apache.flink.runtime.check
>>>>> point.ZooKeeperCheckpointIDCounter  - Removing
>>>>> /checkpoint-counter/ab7e96a1659fbb4bfb3c6cd9bccb0335 from ZooKeeper
>>>>>
>>>>> 2018-08-29 23:15:24,114 INFO  org.apache.flink.runtime.jobma
>>>>> nager.ZooKeeperSubmittedJobGraphStore  - Removed job graph
>>>>> ab7e96a1659fbb4bfb3c6cd9bccb0335 from ZooKeeper.
>>>>>
>>>>> 2018-08-29 23:15:26,126 INFO  org.apache.flink.yarn.YarnJobManager
>>>>>                       - Job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335
>>>>> is in terminal state FAILED. Shutting down session
>>>>>
>>>>> 2018-08-29 23:15:26,128 INFO  org.apache.flink.yarn.YarnJobManager
>>>>>                       - Stopping JobManager with final application
>>>>> status FAILED and diagnostics: The monitored job with ID
>>>>> ab7e96a1659fbb4bfb3c6cd9bccb0335 has failed to complete.
>>>>>
>>>>> 2018-08-29 23:15:26,129 INFO  org.apache.flink.yarn.YarnFlin
>>>>> kResourceManager                - Shutting down cluster with status
>>>>> FAILED : The monitored job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335
>>>>> has failed to complete.
>>>>>
>>>>> 2018-08-29 23:15:26,146 INFO  org.apache.flink.yarn.YarnFlin
>>>>> kResourceManager                - Unregistering application from the
>>>>> YARN Resource Manager
>>>>>
>>>>> 2018-08-29 23:15:27,997 INFO  
>>>>> org.apache.flink.runtime.history.FsJobArchivist
>>>>>               - Job ab7e96a1659fbb4bfb3c6cd9bccb0335 has been
>>>>> archived at hdfs:/savedSearches/prod/compl
>>>>> eted-jobs/ab7e96a1659fbb4bfb3c6cd9bccb0335.
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html>
>>>
>>
>
>
> --
>
> <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html>
>



-- 

<http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html>

Re: Job goes down after 1/145 TMs is lost (NoResourceAvailableException)

Reply via email to