Hi, I am running a streaming job (Flink 1.9) on EMR on yarn. Flink web UI or metrics reported from prometheus shows total memory usage within specified task manager memory - 3GB.
Metrics shows below numbers(in MB) - Heap - 577 Non Heap - 241 DirectMemoryUsed - 852 Non-heap does rise gradually, starting around 210MB and reaching 241 when yarn kills the container. Heap fluctuates between 1.x - .6GB, DirectMemoryUsed is constant at 852. Based on configurations these are the tm params from yarn logs - -Xms1957m -Xmx1957m -XX:MaxDirectMemorySize=1115m These are other params as configuration in flink-conf yarn-cutoff - 270MB Managed memory - 28MB Network memory - 819MB Above memory values are from around the same time the container is killed by yarn for - <container-xxx> is running beyond physical memory limits. Is there anything else which is not reported by flink in metrics or I have been misinterpreting as seen from above total memory consumed is below - 3GB. Same behavior is reported when I have run the job with 2GB, 2.7GB and now 3GB task mem. My job does have shuffles as data from one operator is sent to 4 other operators after filtering. One more thing is I am running this with 3 yarn containers(2 tasks in each container), total parallelism as 6. As soon as one container fails with this error, the job re-starts. However, within minutes other 2 containers also fail with the same error one by one. Thanks, Hemant