Hi Folks, I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few hours after I start a streaming job (built using kafka connect 0.10_2.11) it gets killed seemingly for no reason. After inspecting the logs my best guess is that YARN is killing containers due to high virtual memory usage.
Any guesses on why this might be happening or tips of what I should be looking for? What I'll do next is enable taskmanager.debug.memory.startLogThread to keep investigating. Also, I was deploying flink-1.2-SNAPSHOT-bin-hadoop2.tgz <https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2.tgz> on YARN, but my job uses scala 2.11 dependencies so I'll try using flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz <https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz> instead. - Flink logs: 2016-12-15 17:44:03,763 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 2016-12-15 17:44:05,475 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Container ResourceID{resourceId='container_1481732559439_0002_01_000004'} failed. Exit status: 1 2016-12-15 17:44:05,476 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Diagnostics for container ResourceID{resourceId='container_1481732559439_0002_01_000004'} in state COMPLETE : exitStatus=1 diagnostics=Exception from container-launch. Container id: container_1481732559439_0002_01_000004 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 - YARN logs: container_1481732559439_0002_01_000004: 2.6 GB of 5 GB physical memory used; 38.1 GB of 10.5 GB virtual memory used 2016-12-15 17:44:03,119 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 62223 for container-id container_1481732559439_0002_01_000001: 656.3 MB of 2 GB physical memory used; 3.2 GB of 4.2 GB virtual memory used 2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1481732559439_0002_01_000004 is : 1 2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1481732559439_0002_01_000004 and exit code: 1 ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Best regards, Paulo Cezar