Hi, I am running a simple spark streaming application on hadoop 2.7.0/YARN (master: yarn-client) with 2 executors in different machines. However, while the app is running, I can see on the app web UI (tab executors) that only 1 executor keeps completing tasks over time, the other executor only works and completes tasks for some seconds. From the logs I can see an exception arising, though it is not clear what went wrong.
Here is the yarn-nodemanager log: « 2015-06-17 00:29:50,967 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1434391147618_0007_01_000003 2015-06-17 00:29:50,977 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30553 for container-id container_1434391147618_0007_01_000003: 286.5 MB of 3 GB physical memory used; 2.7 GB of 6.3 GB virtual memory used 2015-06-17 00:29:53,991 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30553 for container-id container_1434391147618_0007_01_000003: 463.7 MB of 3 GB physical memory used; 2.7 GB of 6.3 GB virtual memory used 2015-06-17 00:29:57,009 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30553 for container-id container_1434391147618_0007_01_000003: 465.7 MB of 3 GB physical memory used; 2.7 GB of 6.3 GB virtual memory used 2015-06-17 00:30:00,024 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30553 for container-id container_1434391147618_0007_01_000003: 467.6 MB of 3 GB physical memory used; 2.7 GB of 6.3 GB virtual memory used 2015-06-17 00:30:03,032 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30553 for container-id container_1434391147618_0007_01_000003: 474.0 MB of 3 GB physical memory used; 2.7 GB of 6.3 GB virtual memory used 2015-06-17 00:30:06,041 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30553 for container-id container_1434391147618_0007_01_000003: 480.2 MB of 3 GB physical memory used; 2.7 GB of 6.3 GB virtual memory used 2015-06-17 00:30:09,053 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30553 for container-id container_1434391147618_0007_01_000003: 540.9 MB of 3 GB physical memory used; 2.7 GB of 6.3 GB virtual memory used 2015-06-17 00:30:12,068 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30553 for container-id container_1434391147618_0007_01_000003: 550.9 MB of 3 GB physical memory used; 2.7 GB of 6.3 GB virtual memory used 2015-06-17 00:30:15,075 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30553 for container-id container_1434391147618_0007_01_000003: 551.1 MB of 3 GB physical memory used; 2.7 GB of 6.3 GB virtual memory used 2015-06-17 00:30:18,090 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30553 for container-id container_1434391147618_0007_01_000003: 558.7 MB of 3 GB physical memory used; 2.7 GB of 6.3 GB virtual memory used 2015-06-17 00:30:20,157 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1434391147618_0007_01_000003 is : 1 2015-06-17 00:30:20,157 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1434391147618_0007_01_000003 and exit code: 1 ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-17 00:30:20,157 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch. 2015-06-17 00:30:20,157 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_1434391147618_0007_01_000003 2015-06-17 00:30:20,157 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1 2015-06-17 00:30:20,157 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: ExitCodeException exitCode=1: 2015-06-17 00:30:20,157 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) 2015-06-17 00:30:20,157 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell.run(Shell.java:456) 2015-06-17 00:30:20,157 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) 2015-06-17 00:30:20,158 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) 2015-06-17 00:30:20,158 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) 2015-06-17 00:30:20,158 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) 2015-06-17 00:30:20,158 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.FutureTask.run(FutureTask.java:262) 2015-06-17 00:30:20,158 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 2015-06-17 00:30:20,158 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 2015-06-17 00:30:20,158 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.lang.Thread.run(Thread.java:745) 2015-06-17 00:30:20,158 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 1 2015-06-17 00:30:20,158 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1434391147618_0007_01_000003 transitioned from RUNNING to EXITED_WITH_FAILURE 2015-06-17 00:30:20,158 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1434391147618_0007_01_000003 2015-06-17 00:30:20,178 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /tmp/hadoop-myuser/nm-local-dir/usercache/myuser/appcache/application_1434391147618_0007/container_1434391147618_0007_01_000003 2015-06-17 00:30:20,178 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=myuser OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1434391147618_0007 CONTAINERID=container_1434391147618_0007_01_000003 2015-06-17 00:30:20,178 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1434391147618_0007_01_000003 transitioned from EXITED_WITH_FAILURE to DONE 2015-06-17 00:30:20,179 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Removing container_1434391147618_0007_01_000003 from application application_1434391147618_0007 2015-06-17 00:30:20,179 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1434391147618_0007 2015-06-17 00:30:20,500 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1434391147618_0007 transitioned from RUNNING to APPLICATION_RESOURCES_CLEANINGUP 2015-06-17 00:30:20,501 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /tmp/hadoop-myuser/nm-local-dir/usercache/myuser/appcache/application_1434391147618_0007 2015-06-17 00:30:20,501 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_STOP for appId application_1434391147618_0007 2015-06-17 00:30:20,501 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1434391147618_0007 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED 2015-06-17 00:30:20,501 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler: Scheduling Log Deletion for application: application_1434391147618_0007, with delay of 10800 seconds » Not sure if it is relevant, but in the output of the application I keep getting this message: «15/06/17 00:29:53 INFO ShuffledDStream: Time 1434497393000 ms is invalid as zeroTime is 1434497391000 ms and slideDuration is 4000 ms and difference is 2000 ms» I'm using spark 1.3.2. Any ideas of what can be happening? Thanks.