on the worker/container that fails, the "file not found" is the first error -- the output below is from the yarn log. There were some python worker crashes for another job/stage earlier (see the warning at 18:36) but I expect those to be unrelated to this file not found error.
================================================================================== LogType:stderr Log Upload Time:15-May-2015 18:50:05 LogLength:5706 Log Contents: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/filecache/89/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4 j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 15/05/15 18:33:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/05/15 18:36:37 WARN PythonRDD: Incomplete task interrupted: Attempting to kill Python Worker 15/05/15 18:50:03 ERROR Executor: Exception in task 319.0 in stage 12.0 (TID 995) java.io.FileNotFoundException: /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3 -44da-9410-99c872a89489/03/shuffle_4_319_0.data (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:212) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:201) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:759) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:758) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:823) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:758) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:754) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:754) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) 15/05/15 18:50:04 ERROR DiskBlockManager: Exception while deleting local spark dir: /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489 java.io.IOException: Failed to delete: /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:933) at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:165) at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:162) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.storage.DiskBlockManager.org $apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162) at org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:156) at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1208) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:88) at org.apache.spark.executor.Executor.stop(Executor.scala:146) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedExecutorBackend.scala:105) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.executor.CoarseGrainedExecutorBackend.aroundReceive(CoarseGrainedExecutorBackend.scala:38) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) On Tue, May 19, 2015 at 3:38 AM, Imran Rashid <[email protected]> wrote: > Hi, > > can you take a look at the logs and see what the first error you are > getting is? Its possible that the file doesn't exist when that error is > produced, but it shows up later -- I've seen similar things happen, but > only after there have already been some errors. But, if you see that in > the very first error, then I"m not sure what the cause is. Would be > helpful for you to send the logs. > > Imran > > On Fri, May 15, 2015 at 10:07 AM, rok <[email protected]> wrote: > >> I am trying to sort a collection of key,value pairs (between several >> hundred >> million to a few billion) and have recently been getting lots of >> "FetchFailedException" errors that seem to originate when one of the >> executors doesn't seem to find a temporary shuffle file on disk. E.g.: >> >> org.apache.spark.shuffle.FetchFailedException: >> >> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index >> (No such file or directory) >> >> This file actually exists: >> >> > ls -l >> > >> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index >> >> -rw-r--r-- 1 hadoop hadoop 11936 May 15 16:52 >> >> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index >> >> This error repeats on several executors and is followed by a number of >> >> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output >> location for shuffle 0 >> >> This results on most tasks being lost and executors dying. >> >> There is plenty of space on all of the appropriate filesystems, so none of >> the executors are running out of disk space. Any idea what might be >> causing >> this? I am running this via YARN on approximately 100 nodes with 2 cores >> per >> node. Any thoughts on what might be causing these errors? Thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailedException-and-MetadataFetchFailedException-tp22901.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >
