[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230365#comment-17230365 ]
Xintong Song commented on FLINK-15906: -------------------------------------- Hi [~清月], If the problem does not happen frequently, I would suggest to first try configure a larger JVM overhead memory size. The configuration options are `taskmanager.memory.jvm-overhead.[min|max|fraction]`. https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup.html#capped-fractionated-components > physical memory exceeded causing being killed by yarn > ----------------------------------------------------- > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN > Reporter: liupengcheng > Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_000051] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_000051] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_000051 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.out > 2> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.err > {code} > I suspect there are some leaks or unexpected offheap memory usage. -- This message was sent by Atlassian Jira (v8.3.4#803005)