Re: FetchFailedException and MetadataFetchFailedException

Rok Roskar Fri, 22 May 2015 02:41:47 -0700

on the worker/container that fails, the "file not found" is the first error
-- the output below is from the yarn log. There were some python worker
crashes for another job/stage earlier (see the warning at 18:36) but I
expect those to be unrelated to this file not found error.


==================================================================================
LogType:stderr
Log Upload Time:15-May-2015 18:50:05
LogLength:5706
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/filecache/89/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4
j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/05/15 18:33:09 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/05/15 18:36:37 WARN PythonRDD: Incomplete task interrupted: Attempting
to kill Python Worker
15/05/15 18:50:03 ERROR Executor: Exception in task 319.0 in stage 12.0
(TID 995)
java.io.FileNotFoundException:
/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3
-44da-9410-99c872a89489/03/shuffle_4_319_0.data (No such file or directory)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
        at
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130)
        at
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:201)
        at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:759)
        at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:758)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at
org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:823)
        at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:758)
        at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:754)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:754)
        at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
15/05/15 18:50:04 ERROR DiskBlockManager: Exception while deleting local
spark dir:
/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489
java.io.IOException: Failed to delete:
/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489
        at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:933)
        at
org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:165)
        at
org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:162)
        at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at org.apache.spark.storage.DiskBlockManager.org
$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162)
        at
org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:156)
        at
org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1208)
        at org.apache.spark.SparkEnv.stop(SparkEnv.scala:88)
        at org.apache.spark.executor.Executor.stop(Executor.scala:146)
        at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedExecutorBackend.scala:105)
        at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
        at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
        at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at
org.apache.spark.executor.CoarseGrainedExecutorBackend.aroundReceive(CoarseGrainedExecutorBackend.scala:38)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

On Tue, May 19, 2015 at 3:38 AM, Imran Rashid <[email protected]> wrote:

> Hi,
>
> can you take a look at the logs and see what the first error you are
> getting is?  Its possible that the file doesn't exist when that error is
> produced, but it shows up later -- I've seen similar things happen, but
> only after there have already been some errors.  But, if you see that in
> the very first error, then I"m not sure what the cause is.  Would be
> helpful for you to send the logs.
>
> Imran
>
> On Fri, May 15, 2015 at 10:07 AM, rok <[email protected]> wrote:
>
>> I am trying to sort a collection of key,value pairs (between several
>> hundred
>> million to a few billion) and have recently been getting lots of
>> "FetchFailedException" errors that seem to originate when one of the
>> executors doesn't seem to find a temporary shuffle file on disk. E.g.:
>>
>> org.apache.spark.shuffle.FetchFailedException:
>>
>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>> (No such file or directory)
>>
>> This file actually exists:
>>
>> > ls -l
>> >
>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>
>> -rw-r--r-- 1 hadoop hadoop 11936 May 15 16:52
>>
>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>
>> This error repeats on several executors and is followed by a number of
>>
>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
>> location for shuffle 0
>>
>> This results on most tasks being lost and executors dying.
>>
>> There is plenty of space on all of the appropriate filesystems, so none of
>> the executors are running out of disk space. Any idea what might be
>> causing
>> this? I am running this via YARN on approximately 100 nodes with 2 cores
>> per
>> node. Any thoughts on what might be causing these errors? Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailedException-and-MetadataFetchFailedException-tp22901.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: FetchFailedException and MetadataFetchFailedException

Reply via email to