Hi Folks,

I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few
hours after I start a streaming job (built using kafka connect 0.10_2.11)
it gets killed seemingly for no reason. After inspecting the logs my best
guess is that YARN is killing containers due to high virtual memory usage.

Any guesses on why this might be happening or tips of what I should be
looking for?

What I'll do next is enable taskmanager.debug.memory.startLogThread to keep
investigating. Also, I was deploying flink-1.2-SNAPSHOT-bin-hadoop2.tgz
<https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2.tgz>
on YARN, but my job uses scala 2.11 dependencies so I'll try using
flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz
<https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz>
instead.


   - Flink logs:

2016-12-15 17:44:03,763 WARN  akka.remote.ReliableDeliverySupervisor
                     - Association with remote system
[akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for
[5000] ms. Reason is: [Disassociated].
2016-12-15 17:44:05,475 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                -
Container ResourceID{resourceId='container_1481732559439_0002_01_000004'}
failed. Exit status: 1
2016-12-15 17:44:05,476 INFO
org.apache.flink.yarn.YarnFlinkResourceManager                -
Diagnostics for container
ResourceID{resourceId='container_1481732559439_0002_01_000004'} in
state COMPLETE : exitStatus=1 diagnostics=Exception from
container-launch.
Container id: container_1481732559439_0002_01_000004
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
        at org.apache.hadoop.util.Shell.run(Shell.java:456)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1



   - YARN logs:

container_1481732559439_0002_01_000004: 2.6 GB of 5 GB physical memory
used; 38.1 GB of 10.5 GB virtual memory used
2016-12-15 17:44:03,119 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 62223 for container-id
container_1481732559439_0002_01_000001: 656.3 MB of 2 GB physical
memory used; 3.2 GB of 4.2 GB virtual memory used
2016-12-15 17:44:03,766 WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Exit code from container container_1481732559439_0002_01_000004 is : 1
2016-12-15 17:44:03,766 WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Exception from container-launch with container ID:
container_1481732559439_0002_01_000004 and exit code: 1
ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
        at org.apache.hadoop.util.Shell.run(Shell.java:456)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


Best regards,
Paulo Cezar

Reply via email to