Hi, Could you make simple codes to reproduce the issue? I'm not exactly sure why shuffle data on temp dir. are wrongly deleted.
thanks, On Fri, Feb 26, 2016 at 6:00 AM, Zee Chen <zeo...@gmail.com> wrote: > Hi, > > I am debugging a situation where SortShuffleWriter sometimes fail to > create a file, with the following stack trace: > > 16/02/23 11:48:46 ERROR Executor: Exception in task 13.0 in stage > 47827.0 (TID 1367089) > java.io.FileNotFoundException: > > /tmp/spark-9dd8dca9-6803-4c6c-bb6a-0e9c0111837c/executor-129dfdb8-9422-4668-989e-e789703526ad/blockmgr-dda6e340-7859-468f-b493-04e4162d341a/00/temp_shuffle_69fe1673-9ff2-462b-92b8-683d04669aad > (No such file or directory) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.<init>(FileOutputStream.java:213) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > I checked the linux file system (ext4) and saw the /00/ subdir is > missing. I went through the heap dump of the > CoarseGrainedExecutorBackend jvm proc and found that > DiskBlockManager's subDirs list had more non-null 2-hex subdirs than > present on the file system! As a test I created all 64 2-hex subdirs > by hand and then the problem went away. > > So had anybody else seen this problem? Looking at the relevant logic > in DiskBlockManager and it hasn't changed much since the fix to > https://issues.apache.org/jira/browse/SPARK-6468 > > My configuration: > spark-1.5.1, hadoop-2.6.0, standalone, oracle jdk8u60 > > Thanks, > Zee > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro