Also, can you tell us what OS you are running on? On Fri, Dec 16, 2016 at 6:23 PM, Stephan Ewen <se...@apache.org> wrote:
> Hi! > > To diagnose this a little better, can you help us with the following info: > > - Are you using RocksDB? > - What is your flink configuration, especially around memory settings? > - What do you use for TaskManager heap size? Any manual value, or do you > let Flink/Yarn set it automatically based on container size? > - Do you use any libraries or connectors in your program? > > Greetings, > Stephan > > > On Fri, Dec 16, 2016 at 5:47 PM, Paulo Cezar <paulo.ce...@gogeo.io> wrote: > >> Hi Folks, >> >> I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few >> hours after I start a streaming job (built using kafka connect 0.10_2.11) >> it gets killed seemingly for no reason. After inspecting the logs my best >> guess is that YARN is killing containers due to high virtual memory usage. >> >> Any guesses on why this might be happening or tips of what I should be >> looking for? >> >> What I'll do next is enable taskmanager.debug.memory.startLogThread to >> keep investigating. Also, I was deploying flink-1.2-SNAPSHOT-b >> in-hadoop2.tgz >> <https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2.tgz> >> on YARN, but my job uses scala 2.11 dependencies so I'll try using >> flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz >> <https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz> >> instead. >> >> >> - Flink logs: >> >> 2016-12-15 17:44:03,763 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for >> [5000] ms. Reason is: [Disassociated]. >> 2016-12-15 17:44:05,475 INFO org.apache.flink.yarn.YarnFlinkResourceManager >> - Container >> ResourceID{resourceId='container_1481732559439_0002_01_000004'} failed. Exit >> status: 1 >> 2016-12-15 17:44:05,476 INFO org.apache.flink.yarn.YarnFlinkResourceManager >> - Diagnostics for container >> ResourceID{resourceId='container_1481732559439_0002_01_000004'} in state >> COMPLETE : exitStatus=1 diagnostics=Exception from container-launch. >> Container id: container_1481732559439_0002_01_000004 >> Exit code: 1 >> Stack trace: ExitCodeException exitCode=1: >> at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) >> at org.apache.hadoop.util.Shell.run(Shell.java:456) >> at >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) >> at >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> >> Container exited with a non-zero exit code 1 >> >> >> >> - YARN logs: >> >> container_1481732559439_0002_01_000004: 2.6 GB of 5 GB physical memory used; >> 38.1 GB of 10.5 GB virtual memory used >> 2016-12-15 17:44:03,119 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: >> Memory usage of ProcessTree 62223 for container-id >> container_1481732559439_0002_01_000001: 656.3 MB of 2 GB physical memory >> used; 3.2 GB of 4.2 GB virtual memory used >> 2016-12-15 17:44:03,766 WARN >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit >> code from container container_1481732559439_0002_01_000004 is : 1 >> 2016-12-15 17:44:03,766 WARN >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: >> Exception from container-launch with container ID: >> container_1481732559439_0002_01_000004 and exit code: 1 >> ExitCodeException exitCode=1: >> at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) >> at org.apache.hadoop.util.Shell.run(Shell.java:456) >> at >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) >> at >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> >> Best regards, >> Paulo Cezar >> > >