Hi Borys,

I remember that another user reported a similar issue recently [1] –
attached
to the ticket you can find his log file. If I recall correctly, we concluded
that YARN returned the containers very quickly. At the time, Flink's debug
level logs were inconclusive because we did not log the reason why the
container was returned, and the user could not provide us the YARN logs. In
1.5.4, we improved the logging [2]. Hence, it would be good if you can
reproduce this with debug level logging using Flink 1.5.4.

You could also try the vanilla Hadoop distribution, or Flink 1.6.

Best,
Gary

[1] https://issues.apache.org/jira/browse/FLINK-10104
[2] https://issues.apache.org/jira/browse/FLINK-10137

On Tue, Oct 9, 2018 at 5:55 PM Borys Gogulski <borys.gogul...@mapp.com>
wrote:

> Hey guys,
>
> thanks for the replies.
> 1. "Requesting new TaskExecutor" looks fine as it's exactly 32 as is jobs'
> parallelism set.
> The weird thing is that after those 32 containers requested and received we
> have this "flood" of 'Received new container/Returning excess container`
> (and as shown below it's actually doing something on YARN side)
> Where does those come from?
> 2. I felt that DEBUG will be needed, we'll see what we can do about it.
> 3. Yes, all in favor for upgrading to 1.5.4. But as Gary mentioned there
> seems to be no fixes that could heal it (I was reading release notes
> previous to posting this thread ; )).
> 4. Hadoop: 2.6.0+cdh5.14.0
>
> Here are logs for one of "excess" containers:
> 1. Flink JM
> 2018-10-09 17:35:33,493 INFO  org.apache.flink.yarn.YarnResourceManager
>
> - Received new container: container_e96_1538374332137_3071_01_2485560 -
> Remaining pending container requests: 0
> 2018-10-09 17:35:33,493 INFO  org.apache.flink.yarn.YarnResourceManager
>
> - Returning excess container container_e96_1538374332137_3071_01_2485560.
> 2. YARN
> 2018-10-09 17:35:33,283 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e96_1538374332137_3071_01_2485560 Container Transitioned from NEW
> to ALLOCATED
> 2018-10-09 17:35:33,283 INFO
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=user
>
> OPERATION=AM Allocated Container        TARGET=SchedulerApp
> RESULT=SUCCESS  APPID=application_1538374332137_3071
> CONTAINERID=container_e96_1538374332137_3071_01_2485560
> 2018-10-09 17:35:33,283 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
> Assigned container container_e96_1538374332137_3071_01_2485560 of capacity
> <memory:6144, vCores:1> on host server:44142, which has 5 containers,
> <memory:30720, vCores:5> used and <memory:2048, vCores:11> available after
> allocation
> 2018-10-09 17:35:33,283 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> assignedContainer application attempt=appattempt_1538374332137_3071_000001
> container=Container: [ContainerId:
> container_e96_1538374332137_3071_01_2485560, NodeId: server:44142,
> NodeHttpAddress: server:8042, Resource: <memory:6144, vCores:1>, Priority:
> 0, Token: null, ] queue=queue: capacity=0.5, absoluteCapacity=0.5,
> usedResources=<memory:2353152, vCores:383>, usedCapacity=1.9947916,
> absoluteUsedCapacity=0.9973958, numApps=2, numContainers=383
> clusterResource=<memory:2359296, vCores:1152>
> 2018-10-09 17:35:33,485 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e96_1538374332137_3071_01_2485560 Container Transitioned from
> ALLOCATED to ACQUIRED
> 2018-10-09 17:35:38,532 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e96_1538374332137_3071_01_2485560 Container Transitioned from
> ACQUIRED to RELEASED
> 2018-10-09 17:35:38,532 INFO
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
> Completed container: container_e96_1538374332137_3071_01_2485560 in state:
> RELEASED event:RELEASED
> 2018-10-09 17:35:38,532 INFO
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=user
>
> IP=ip      OPERATION=AM Released Container TARGET=SchedulerApp
> RESULT=SUCCESS  APPID=application_1538374332137_3071
> CONTAINERID=container_e96_1538374332137_3071_01_2485560
> 2018-10-09 17:35:38,532 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
> Released container container_e96_1538374332137_3071_01_2485560 of capacity
> <memory:6144, vCores:1> on host server:44142, which currently has 0
> containers, <memory:0, vCores:0> used and <memory:32768, vCores:16>
> available, release resources=true
> 2018-10-09 17:35:38,532 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> completedContainer container=Container: [ContainerId:
> container_e96_1538374332137_3071_01_2485560, NodeId: server:44142,
> NodeHttpAddress: server:8042, Resource: <memory:6144, vCores:1>, Priority:
> 0, Token: Token { kind: ContainerToken, service: ip:44142 }, ] queue=queue:
> capacity=0.5, absoluteCapacity=0.5, usedResources=<memory:589824,
> vCores:96>, usedCapacity=0.5, absoluteUsedCapacity=0.25, numApps=2,
> numContainers=96 cluster=<memory:2359296, vCores:1152>
> 2018-10-09 17:35:38,532 INFO
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Application attempt appattempt_1538374332137_3071_000001 released container
> container_e96_1538374332137_3071_01_2485560 on node: host: server:44142
> #containers=0 available=32768 used=0 with event: RELEASED
>
> Best regards,
> Borys Gogulski
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Reply via email to