[ https://issues.apache.org/jira/browse/FLINK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694177#comment-16694177 ]
zhijiang edited comment on FLINK-10884 at 11/21/18 3:47 AM: ------------------------------------------------------------ I just quickly reviewed the related codes. In my analysis: In the process of {{_ContaineredTaskManagerParameters#create_ method,}} {{"offHeapSizeMB = containerMemoryMB - heapSizeMB"}} {{The _containerMemoryMB_}} is the container's total physical memory including _{{cutofff}}_ and the _{{heapSizeMB}}_ is not covered _{{cutoff}}_ during calculation, so the _{{offHeapSizeMB}}_ would cover _{{cutoff}}_ as a result. In the _{{testOffHeapMemoryWithDefaultConfiguration}}_, the _{{networkBufMB}}_ is not covered _{{cutoff}}_ during calculation, so it should be added _{{cutoff}}_ factor to compare with the above _{{offHeapSizeMB}}_. was (Author: zjwang): I just quickly reviewed the related codes. In my analysis: In the process of {{ContaineredTaskManagerParameters#create method,}} {{offHeapSizeMB = containerMemoryMB - heapSizeMB }} {{The }}{{containerMemoryMB}} is the container's total physical memory including {{cutofff}} and t{{he }}{{heapSizeMB}} is not covered {{cutoff}} during calculation, so the {{offHeapSizeMB}} would cover {{cutoff}} as a result. In the test {{testOffHeapMemoryWithDefaultConfiguration}}, the {{networkBufMB}} is not covered {{cutoff}} during calculation, so it should be added {{cutoff}} factor to compare with the above {{offHeapSizeMB}}. > Flink on yarn TM container will be killed by nodemanager because of the > exceeded physical memory. > ---------------------------------------------------------------------------------------------------- > > Key: FLINK-10884 > URL: https://issues.apache.org/jira/browse/FLINK-10884 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Core > Affects Versions: 1.5.5, 1.6.2, 1.7.0 > Environment: version : 1.6.2 > module : flink on yarn > centos jdk1.8 > hadoop 2.7 > Reporter: wgcn > Assignee: wgcn > Priority: Major > Labels: yarn > > TM container will be killed by nodemanager because of the exceeded > [physical|http://www.baidu.com/link?url=Y4LyfMDH59n9-Ey16Fo6EFAYltN1e9anB3y2ynhVmdvuIBCkJGdH0hTExKDZRvXNr6hqhwIXs8JjYqesYbx0BOpQDD0o1VjbVQlOC-9MgXi] > memory. I found the lanuch context lanuching TM container that > "container memory = heap memory+ offHeapSizeMB" at the class > org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters > from line 160 to 166 I set a safety margin for the whole memory container > using. For example if the container limit 3g memory, the sum memory that > "heap memory+ offHeapSizeMB" is equal to 2.4g to prevent the container > being killed.Do we have the > [ready-made|http://www.baidu.com/link?url=ylC8cEafGU6DWAdU9ADcJPNugkjbx6IjtqIIxJ9foX4_Yfgc7ctWmpEpQRettVmBiOy7Wfph7S1UvN5LiJj-G1Rsb--oDw4Z2OEbA5Fj0bC] > solution or I can commit my solution -- This message was sent by Atlassian JIRA (v7.6.3#76005)