The node manager log looks like this - not exactly sure what this means, but the container messages seem to indicate there is still plenty of memory.
2014-09-09 21:47:00,718 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26819 for container-id container_1410224367331_0006_01_000003: 319.7 MB of 1.5 GB physical memory used; 1.7 GB of 7.5 GB virtual memory used 2014-09-09 21:47:03,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26819 for container-id container_1410224367331_0006_01_000003: 321.2 MB of 1.5 GB physical memory used; 1.7 GB of 7.5 GB virtual memory used 2014-09-09 21:47:06,736 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26819 for container-id container_1410224367331_0006_01_000003: 321.5 MB of 1.5 GB physical memory used; 1.7 GB of 7.5 GB virtual memory used 2014-09-09 21:47:09,743 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26819 for container-id container_1410224367331_0006_01_000003: 321.5 MB of 1.5 GB physical memory used; 1.7 GB of 7.5 GB virtual memory used 2014-09-09 21:47:12,755 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26819 for container-id container_1410224367331_0006_01_000003: 321.9 MB of 1.5 GB physical memory used; 1.7 GB of 7.5 GB virtual memory used 2014-09-09 21:47:15,762 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26819 for container-id container_1410224367331_0006_01_000003: 322.8 MB of 1.5 GB physical memory used; 1.7 GB of 7.5 GB virtual memory used 2014-09-09 21:47:18,430 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 6 cluster_timestamp: 1410224367331 } attemptId: 1 } id: 3 } state: C_RUNNING diagnostics: "" exit_status: -1000 2014-09-09 21:47:18,769 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26819 for container-id container_1410224367331_0006_01_000003: 322.8 MB of 1.5 GB physical memory used; 1.7 GB of 7.5 GB virtual memory used 2014-09-09 21:47:21,777 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26819 for container-id container_1410224367331_0006_01_000003: 322.8 MB of 1.5 GB physical memory used; 1.7 GB of 7.5 GB virtual memory used 2014-09-09 21:47:24,784 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26819 for container-id container_1410224367331_0006_01_000003: 324.1 MB of 1.5 GB physical memory used; 1.7 GB of 7.5 GB virtual memory used 2014-09-09 21:47:26,345 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1410224367331_0006_01_000003 is : 1 2014-09-09 21:47:26,345 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1410224367331_0006_01_000003 and exit code: 1 org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 2014-09-09 21:47:26,345 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: 2014-09-09 21:47:26,345 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 1 2014-09-09 21:47:26,346 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410224367331_0006_01_000003 transitioned from RUNNING to EXITED_WITH_FAILURE 2014-09-09 21:47:26,346 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1410224367331_0006_01_000003 2014-09-09 21:47:26,371 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1410224367331_0006/container_1410224367331_0006_01_000003 2014-09-09 21:47:26,371 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1410224367331_0006 CONTAINERID=container_1410224367331_0006_01_000003 2014-09-09 21:47:26,372 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410224367331_0006_01_000003 transitioned from EXITED_WITH_FAILURE to DONE 2014-09-09 21:47:26,372 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1410224367331_0006_01_000003 from application application_1410224367331_0006 On Sep 9, 2014, at 2:10 PM, Marcelo Vanzin <van...@cloudera.com<mailto:van...@cloudera.com>> wrote: This has all the symptoms of Yarn killing your executors due to them exceeding their memory limits. Could you check your RM/NM logs to see if that's the case? (The error was because of an executor at domU-12-31-39-0B-F1-D1.compute-1.internal, so you can check that NM's log file.) If that's the case, you'll need to play with the "spark.yarn.executor.memoryOverhead" config (see http://spark.apache.org/docs/latest/running-on-yarn.html). On Tue, Sep 9, 2014 at 1:13 PM, Penny Espinoza <pesp...@societyconsulting.com<mailto:pesp...@societyconsulting.com>> wrote: Hey - I have a Spark 1.0.2 job (using spark-streaming-kafka) that runs successfully using master = local[4]. However, when I run it on a Hadoop 2.2 EMR cluster using master yarn-client, it fails after running for about 5 minutes. My main method does something like this: gets streaming context creates a DStream from KafkaUtils.createStream (let’s call this a) creates another DStream from a.flatMap (let’s call this b) creates another DStream from b.updateStateByKey (c) creates another DStream from c.flatMap (d) runs d.foreachRDD, and uses Tuplejump’s Calliope cql3SaveToCassandra to save data to Cassandra Here’s the error message: 14/09/09 20:06:04 WARN scheduler.TaskSetManager: Loss was due to java.lang.Exception java.lang.Exception: Could not compute split, block input-0-1410293154000 not found at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) According to logs, the block referenced in the error message was added to memory only about 10 seconds before this error appears, and there is no evidence in the log of any blocks being cleared from memory, or of it running out of memory. Below is a snippet of the log: 14/09/09 20:05:54 INFO storage.BlockManagerInfo: Added input-0-1410293154000 in memory on domU-12-31-39-07-B0-11.compute-1.internal:55682 (size: 214.4 KB, free: 490.8 MB) … ( several other messages like the one above, but for different blocks) 14/09/09 20:05:58 INFO storage.BlockManagerInfo: Added input-0-1410293158000 in memory on domU-12-31-39-07-B0-11.compute-1.internal:55682 (size: 3.1 MB, free: 457.3 MB) 14/09/09 20:06:00 INFO scheduler.ReceiverTracker: Stream 0 received 21 blocks 14/09/09 20:06:00 INFO dstream.StateDStream: Persisting RDD 57 for time 1410293160000 ms to StorageLevel(false, true, false, false, 1) at time 1410293160000 ms 14/09/09 20:06:00 INFO dstream.StateDStream: Marking RDD 57 for time 1410293160000 ms for checkpointing at time 1410293160000 ms 14/09/09 20:06:00 INFO scheduler.JobScheduler: Added jobs for time 1410293160000 ms 14/09/09 20:06:00 INFO scheduler.JobGenerator: Checkpointing graph for time 1410293160000 ms 14/09/09 20:06:00 INFO streaming.DStreamGraph: Updating checkpoint data for time 1410293160000 ms 14/09/09 20:06:00 INFO streaming.DStreamGraph: Updated checkpoint data for time 1410293160000 ms 14/09/09 20:06:00 INFO scheduler.JobScheduler: Starting job streaming job 1410293160000 ms.0 from job set of time 1410293160000 ms 14/09/09 20:06:00 INFO streaming.CheckpointWriter: Saving checkpoint for time 1410293160000 ms to file 'hdfs://10.240.49.124:9000/user/hadoop/checkpoint/checkpoint-1410293160000' 14/09/09 20:06:00 INFO spark.SparkContext: Starting job: saveAsNewAPIHadoopFile at CassandraRDDFunctions.scala:203 14/09/09 20:06:00 INFO scheduler.DAGScheduler: Registering RDD 54 (flatMap at FlatMappedDStream.scala:35) 14/09/09 20:06:00 INFO scheduler.DAGScheduler: Got job 12 (saveAsNewAPIHadoopFile at CassandraRDDFunctions.scala:203) with 2 output partitions (allowLocal=false) 14/09/09 20:06:00 INFO scheduler.DAGScheduler: Final stage: Stage 23(saveAsNewAPIHadoopFile at CassandraRDDFunctions.scala:203) 14/09/09 20:06:00 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 24) 14/09/09 20:06:00 INFO scheduler.DAGScheduler: Missing parents: List(Stage 24) 14/09/09 20:06:00 INFO scheduler.DAGScheduler: Submitting Stage 24 (FlatMappedRDD[54] at flatMap at FlatMappedDStream.scala:35), which has no missing parents 14/09/09 20:06:00 INFO streaming.CheckpointWriter: Deleting hdfs://10.240.49.124:9000/user/hadoop/checkpoint/checkpoint-1410293110000.bk 14/09/09 20:06:00 INFO streaming.CheckpointWriter: Checkpoint for time 1410293160000 ms saved to file 'hdfs://10.240.49.124:9000/user/hadoop/checkpoint/checkpoint-1410293160000', took 4351 bytes and 55 ms 14/09/09 20:06:00 INFO streaming.DStreamGraph: Clearing checkpoint data for time 1410293160000 ms 14/09/09 20:06:00 INFO streaming.DStreamGraph: Cleared checkpoint data for time 1410293160000 ms 14/09/09 20:06:00 INFO scheduler.DAGScheduler: Submitting 21 missing tasks from Stage 24 (FlatMappedRDD[54] at flatMap at FlatMappedDStream.scala:35) 14/09/09 20:06:00 INFO cluster.YarnClientClusterScheduler: Adding task set 24.0 with 21 tasks 14/09/09 20:06:04 INFO scheduler.TaskSetManager: Starting task 24.0:0 as TID 91 on executor 2: domU-12-31-39-0B-F1-D1.compute-1.internal (RACK_LOCAL) 14/09/09 20:06:04 INFO scheduler.TaskSetManager: Serialized task 24.0:0 as 1437 bytes in 0 ms 14/09/09 20:06:04 INFO scheduler.TaskSetManager: Starting task 24.0:1 as TID 92 on executor 2: domU-12-31-39-0B-F1-D1.compute-1.internal (RACK_LOCAL) 14/09/09 20:06:04 INFO scheduler.TaskSetManager: Serialized task 24.0:1 as 1437 bytes in 0 ms 14/09/09 20:06:04 WARN scheduler.TaskSetManager: Lost TID 91 (task 24.0:0) 14/09/09 20:06:04 WARN scheduler.TaskSetManager: Loss was due to java.lang.Exception java.lang.Exception: Could not compute split, block input-0-1410293154000 not found at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) I have seen earlier threads recommend getting around this problem by setting spark.streaming.unpersist to false, but that had no effect on the error. I have also tried making some changes to shuffle-related settings. My conf from logs is below. 14/09/09 20:04:32 INFO spark.SparkContext: Spark configuration: spark.app.name=foo spark.cleaner.ttl=86400 spark.home=/home/hadoop/spark spark.jars=file:/home/hadoop/rna/rna-spark-streaming-assembly-1.0-SNAPSHOT.jar spark.logConf=true spark.master=yarn-client spark.shuffle.consolidateFiles=true spark.shuffle.memoryFraction=0.4 spark.storage.memoryFraction=0.5 spark.streaming.unpersist=false spark.yarn.dist.archives=file:/home/hadoop/rna/lib/jackson-core-asl-1.8.9.jar,file:/home/hadoop/rna/lib/jackson-mapper-asl-1.8.9.jar Any idea of what might be happening to this block and how I might eliminate this error? thanks p -- Marcelo