Hi Borys, To debug how many containers Flink is requesting, you can look out for the log statement below [1]:
Requesting new TaskExecutor container with resources [...] If you need help debugging, can you attach the full JM logs (preferably on DEBUG level)? Would it be possible for you to test against 1.5.3 and 1.5.4? However, I am not aware of any related issues that were fixed for 1.5.3 or 1.5.4. What is the Hadoop distribution that you are using? Best, Gary [1] https://github.com/apache/flink/blob/release-1.5.2/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L454 On Wed, Oct 3, 2018 at 11:36 AM Borys Gogulski <borys.gogul...@mapp.com> wrote: > Hey, > > > > We’re running Flink 1.5.2 (I know there’s 1.5.4 and 1.6.1) on YARN for > some jobs we’re processing. It’s a “long running” container to which we’re > submitting jobs – all jobs submitted to that container have got parallelism > of 32 (to be precise: in this job there are 8 subtasks with parallelism 32 > and one subtask with parallelism 1), we’re running max 8 of them. TMs are > set to have one slot only and 6GB RAM each. > On the beginning, when using Flink 1.5.0 and 1.5.1 with the “on-demand” > resources policy we were noticing that more containers than it’s required > are spawned but with Flink 1.5.2 it “stabilized” – there were obviously > some containers kept for some time after job finished (and no additional > job was submitted to take those resources) but overhead wasn’t big so we > were “all good”. > And here’s the plot twist. > For couple days now we’re witnessing situations in which spawning one job > makes Flink request couple hundreds of TMs. Additionally in JM’s logs we > can find dozens of lines like: > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Received > new container: container_e96_1538374332137_0793_01_594295 - Remaining > pending container requests: 0 > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Returning > excess container container_e96_1538374332137_0793_01_594295. > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Received > new container: container_e96_1538374332137_0793_01_594300 - Remaining > pending container requests: 0 > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Returning > excess container container_e96_1538374332137_0793_01_594300. > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Received > new container: container_e96_1538374332137_0793_01_594303 - Remaining > pending container requests: 0 > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Returning > excess container container_e96_1538374332137_0793_01_594303. > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Received > new container: container_e96_1538374332137_0793_01_594304 - Remaining > pending container requests: 0 > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Returning > excess container container_e96_1538374332137_0793_01_594304. > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Received > new container: container_e96_1538374332137_0793_01_594334 - Remaining > pending container requests: 0 > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Returning > excess container container_e96_1538374332137_0793_01_594334. > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Received > new container: container_e96_1538374332137_0793_01_594337 - Remaining > pending container requests: 0 > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Returning > excess container container_e96_1538374332137_0793_01_594337. > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Received > new container: container_e96_1538374332137_0793_01_594152 - Remaining > pending container requests: 0 > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Returning > excess container container_e96_1538374332137_0793_01_594152. > > 2018-10-03 11:08:27,186 INFO > org.apache.flink.yarn.YarnResourceManager - Received > new container: container_e96_1538374332137_0793_01_594410 - Remaining > pending container requests: 0 > > 2018-10-03 11:08:27,187 INFO > org.apache.flink.yarn.YarnResourceManager - Returning > excess container container_e96_1538374332137_0793_01_594410. > > Only change made last week seems to be adding 5 new nodes to YARN Cluster. > Any ideas why it’s requesting so many containers? Any ideas why there’s > this “Received/Returning” flood? Right now one job was started and out of a > sudden 352 containers were requested from YARN (also almost closing YARN’s > queue on RAM) > > > > We’re also experiencing JMs hangs (we can’t view UI + TMs can’t > communicate with JM) but first I’d like to resolve above “issue” as it > might be cause for rest of our problems. > > > > Best regards, > Borys Gogulski >