Hi Borys,

To debug how many containers Flink is requesting, you can look out for the
log
statement below [1]:

    Requesting new TaskExecutor container with resources [...]

If you need help debugging, can you attach the full JM logs (preferably on
DEBUG level)? Would it be possible for you to test against 1.5.3 and 1.5.4?
However, I am not aware of any related issues that were fixed for 1.5.3 or
1.5.4. What is the Hadoop distribution that you are using?

Best,
Gary

[1]
https://github.com/apache/flink/blob/release-1.5.2/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L454

On Wed, Oct 3, 2018 at 11:36 AM Borys Gogulski <borys.gogul...@mapp.com>
wrote:

> Hey,
>
>
>
> We’re running Flink 1.5.2 (I know there’s 1.5.4 and 1.6.1) on YARN for
> some jobs we’re processing. It’s a “long running” container to which we’re
> submitting jobs – all jobs submitted to that container have got parallelism
> of 32 (to be precise: in this job there are 8 subtasks with parallelism 32
> and one subtask with parallelism 1), we’re running max 8 of them. TMs are
> set to have one slot only and 6GB RAM each.
> On the beginning, when using Flink 1.5.0 and 1.5.1 with the “on-demand”
> resources policy we were noticing that more containers than it’s required
> are spawned but with Flink 1.5.2 it “stabilized” – there were obviously
> some containers kept for some time after job finished (and no additional
> job was submitted to take those resources) but overhead wasn’t big so we
> were “all good”.
> And here’s the plot twist.
> For couple days now we’re witnessing situations in which spawning one job
> makes Flink request couple hundreds of TMs. Additionally in JM’s logs we
> can find dozens of lines like:
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594295 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594295.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594300 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594300.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594303 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594303.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594304 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594304.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594334 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594334.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594337 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594337.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594152 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594152.
>
> 2018-10-03 11:08:27,186 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Received
> new container: container_e96_1538374332137_0793_01_594410 - Remaining
> pending container requests: 0
>
> 2018-10-03 11:08:27,187 INFO
> org.apache.flink.yarn.YarnResourceManager                     - Returning
> excess container container_e96_1538374332137_0793_01_594410.
>
> Only change made last week seems to be adding 5 new nodes to YARN Cluster.
> Any ideas why it’s requesting so many containers? Any ideas why there’s
> this “Received/Returning” flood? Right now one job was started and out of a
> sudden 352 containers were requested from YARN (also almost closing YARN’s
> queue on RAM)
>
>
>
> We’re also experiencing JMs hangs (we can’t view UI + TMs can’t
> communicate with JM) but first I’d like to resolve above “issue” as it
> might be cause for rest of our problems.
>
>
>
> Best regards,
> Borys Gogulski
>

Reply via email to