I am trying to sort a collection of key,value pairs (between several hundred million to a few billion) and have recently been getting lots of "FetchFailedException" errors that seem to originate when one of the executors doesn't seem to find a temporary shuffle file on disk. E.g.:
org.apache.spark.shuffle.FetchFailedException: /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index (No such file or directory) This file actually exists: > ls -l > /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index -rw-r--r-- 1 hadoop hadoop 11936 May 15 16:52 /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index This error repeats on several executors and is followed by a number of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 This results on most tasks being lost and executors dying. There is plenty of space on all of the appropriate filesystems, so none of the executors are running out of disk space. Any idea what might be causing this? I am running this via YARN on approximately 100 nodes with 2 cores per node. Any thoughts on what might be causing these errors? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailedException-and-MetadataFetchFailedException-tp22901.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org