Hey guys, thanks for the replies. 1. "Requesting new TaskExecutor" looks fine as it's exactly 32 as is jobs' parallelism set. The weird thing is that after those 32 containers requested and received we have this "flood" of 'Received new container/Returning excess container` (and as shown below it's actually doing something on YARN side) Where does those come from? 2. I felt that DEBUG will be needed, we'll see what we can do about it. 3. Yes, all in favor for upgrading to 1.5.4. But as Gary mentioned there seems to be no fixes that could heal it (I was reading release notes previous to posting this thread ; )). 4. Hadoop: 2.6.0+cdh5.14.0
Here are logs for one of "excess" containers: 1. Flink JM 2018-10-09 17:35:33,493 INFO org.apache.flink.yarn.YarnResourceManager - Received new container: container_e96_1538374332137_3071_01_2485560 - Remaining pending container requests: 0 2018-10-09 17:35:33,493 INFO org.apache.flink.yarn.YarnResourceManager - Returning excess container container_e96_1538374332137_3071_01_2485560. 2. YARN 2018-10-09 17:35:33,283 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e96_1538374332137_3071_01_2485560 Container Transitioned from NEW to ALLOCATED 2018-10-09 17:35:33,283 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=user OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1538374332137_3071 CONTAINERID=container_e96_1538374332137_3071_01_2485560 2018-10-09 17:35:33,283 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_e96_1538374332137_3071_01_2485560 of capacity <memory:6144, vCores:1> on host server:44142, which has 5 containers, <memory:30720, vCores:5> used and <memory:2048, vCores:11> available after allocation 2018-10-09 17:35:33,283 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application attempt=appattempt_1538374332137_3071_000001 container=Container: [ContainerId: container_e96_1538374332137_3071_01_2485560, NodeId: server:44142, NodeHttpAddress: server:8042, Resource: <memory:6144, vCores:1>, Priority: 0, Token: null, ] queue=queue: capacity=0.5, absoluteCapacity=0.5, usedResources=<memory:2353152, vCores:383>, usedCapacity=1.9947916, absoluteUsedCapacity=0.9973958, numApps=2, numContainers=383 clusterResource=<memory:2359296, vCores:1152> 2018-10-09 17:35:33,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e96_1538374332137_3071_01_2485560 Container Transitioned from ALLOCATED to ACQUIRED 2018-10-09 17:35:38,532 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e96_1538374332137_3071_01_2485560 Container Transitioned from ACQUIRED to RELEASED 2018-10-09 17:35:38,532 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_e96_1538374332137_3071_01_2485560 in state: RELEASED event:RELEASED 2018-10-09 17:35:38,532 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=user IP=ip OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1538374332137_3071 CONTAINERID=container_e96_1538374332137_3071_01_2485560 2018-10-09 17:35:38,532 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_e96_1538374332137_3071_01_2485560 of capacity <memory:6144, vCores:1> on host server:44142, which currently has 0 containers, <memory:0, vCores:0> used and <memory:32768, vCores:16> available, release resources=true 2018-10-09 17:35:38,532 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_e96_1538374332137_3071_01_2485560, NodeId: server:44142, NodeHttpAddress: server:8042, Resource: <memory:6144, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: ip:44142 }, ] queue=queue: capacity=0.5, absoluteCapacity=0.5, usedResources=<memory:589824, vCores:96>, usedCapacity=0.5, absoluteUsedCapacity=0.25, numApps=2, numContainers=96 cluster=<memory:2359296, vCores:1152> 2018-10-09 17:35:38,532 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1538374332137_3071_000001 released container container_e96_1538374332137_3071_01_2485560 on node: host: server:44142 #containers=0 available=32768 used=0 with event: RELEASED Best regards, Borys Gogulski -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/