Hi Borys,

if possible the complete logs of the JM (DEBUG log level) would be helpful
to further debug the problem. Have there been recovery operations lately?

Cheers,
Till

On Sat, Oct 6, 2018 at 11:15 AM Gary Yao <g...@data-artisans.com> wrote:

> Hi Borys,
>
> To debug how many containers Flink is requesting, you can look out for the
> log
> statement below [1]:
>
>     Requesting new TaskExecutor container with resources [...]
>
> If you need help debugging, can you attach the full JM logs (preferably on
> DEBUG level)? Would it be possible for you to test against 1.5.3 and 1.5.4?
> However, I am not aware of any related issues that were fixed for 1.5.3 or
> 1.5.4. What is the Hadoop distribution that you are using?
>
> Best,
> Gary
>
> [1]
> https://github.com/apache/flink/blob/release-1.5.2/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L454
>
> On Wed, Oct 3, 2018 at 11:36 AM Borys Gogulski <borys.gogul...@mapp.com>
> wrote:
>
>> Hey,
>>
>>
>>
>> We’re running Flink 1.5.2 (I know there’s 1.5.4 and 1.6.1) on YARN for
>> some jobs we’re processing. It’s a “long running” container to which we’re
>> submitting jobs – all jobs submitted to that container have got parallelism
>> of 32 (to be precise: in this job there are 8 subtasks with parallelism 32
>> and one subtask with parallelism 1), we’re running max 8 of them. TMs are
>> set to have one slot only and 6GB RAM each.
>> On the beginning, when using Flink 1.5.0 and 1.5.1 with the “on-demand”
>> resources policy we were noticing that more containers than it’s required
>> are spawned but with Flink 1.5.2 it “stabilized” – there were obviously
>> some containers kept for some time after job finished (and no additional
>> job was submitted to take those resources) but overhead wasn’t big so we
>> were “all good”.
>> And here’s the plot twist.
>> For couple days now we’re witnessing situations in which spawning one job
>> makes Flink request couple hundreds of TMs. Additionally in JM’s logs we
>> can find dozens of lines like:
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594295 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594295.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594300 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594300.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594303 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594303.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594304 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594304.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594334 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594334.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594337 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594337.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594152 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594152.
>>
>> 2018-10-03 11:08:27,186 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Received
>> new container: container_e96_1538374332137_0793_01_594410 - Remaining
>> pending container requests: 0
>>
>> 2018-10-03 11:08:27,187 INFO
>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>> excess container container_e96_1538374332137_0793_01_594410.
>>
>> Only change made last week seems to be adding 5 new nodes to YARN Cluster.
>> Any ideas why it’s requesting so many containers? Any ideas why there’s
>> this “Received/Returning” flood? Right now one job was started and out of a
>> sudden 352 containers were requested from YARN (also almost closing YARN’s
>> queue on RAM)
>>
>>
>>
>> We’re also experiencing JMs hangs (we can’t view UI + TMs can’t
>> communicate with JM) but first I’d like to resolve above “issue” as it
>> might be cause for rest of our problems.
>>
>>
>>
>> Best regards,
>> Borys Gogulski
>>
>

Reply via email to