I am trying to sort a collection of key,value pairs (between several hundred
million to a few billion) and have recently been getting lots of
"FetchFailedException" errors that seem to originate when one of the
executors doesn't seem to find a temporary shuffle file on disk. E.g.: 

org.apache.spark.shuffle.FetchFailedException:
/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
(No such file or directory)

This file actually exists: 

> ls -l
> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index

-rw-r--r-- 1 hadoop hadoop 11936 May 15 16:52
/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index

This error repeats on several executors and is followed by a number of 

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0

This results on most tasks being lost and executors dying. 

There is plenty of space on all of the appropriate filesystems, so none of
the executors are running out of disk space. Any idea what might be causing
this? I am running this via YARN on approximately 100 nodes with 2 cores per
node. Any thoughts on what might be causing these errors? Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailedException-and-MetadataFetchFailedException-tp22901.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to