Usually it happens in Linux when application deletes file w/o double checking that there are no open FDs (resource leak). In this case, Linux holds all space allocated and does not release it until application exits (crashes in your case). You check file system and everything is normal, you have enough space and you have no idea why does application report "no space left on device".
Just a guess. -Vladimir Rodionov On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass <jw...@crossref.org> wrote: > I'm running a cluster of 3 Amazon EC2 machines (small number because it's > expensive when experiments keep crashing after a day!). > > Today's crash looks like this (stacktrace at end of message). > org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output > location for shuffle 0 > > On my three nodes, I have plenty of space and inodes: > > A $ df -i > Filesystem Inodes IUsed IFree IUse% Mounted on > /dev/xvda1 524288 97937 426351 19% / > tmpfs 1909200 1 1909199 1% /dev/shm > /dev/xvdb 2457600 54 2457546 1% /mnt > /dev/xvdc 2457600 24 2457576 1% /mnt2 > /dev/xvds 831869296 23844 831845452 1% /vol0 > > A $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/xvda1 7.9G 3.4G 4.5G 44% / > tmpfs 7.3G 0 7.3G 0% /dev/shm > /dev/xvdb 37G 1.2G 34G 4% /mnt > /dev/xvdc 37G 177M 35G 1% /mnt2 > /dev/xvds 1000G 802G 199G 81% /vol0 > > B $ df -i > Filesystem Inodes IUsed IFree IUse% Mounted on > /dev/xvda1 524288 97947 426341 19% / > tmpfs 1906639 1 1906638 1% /dev/shm > /dev/xvdb 2457600 54 2457546 1% /mnt > /dev/xvdc 2457600 24 2457576 1% /mnt2 > /dev/xvds 816200704 24223 816176481 1% /vol0 > > B $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/xvda1 7.9G 3.6G 4.3G 46% / > tmpfs 7.3G 0 7.3G 0% /dev/shm > /dev/xvdb 37G 1.2G 34G 4% /mnt > /dev/xvdc 37G 177M 35G 1% /mnt2 > /dev/xvds 1000G 805G 195G 81% /vol0 > > C $df -i > Filesystem Inodes IUsed IFree IUse% Mounted on > /dev/xvda1 524288 97938 426350 19% / > tmpfs 1906897 1 1906896 1% /dev/shm > /dev/xvdb 2457600 54 2457546 1% /mnt > /dev/xvdc 2457600 24 2457576 1% /mnt2 > /dev/xvds 755218352 24024 755194328 1% /vol0 > root@ip-10-204-136-223 ~]$ > > C $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/xvda1 7.9G 3.4G 4.5G 44% / > tmpfs 7.3G 0 7.3G 0% /dev/shm > /dev/xvdb 37G 1.2G 34G 4% /mnt > /dev/xvdc 37G 177M 35G 1% /mnt2 > /dev/xvds 1000G 820G 181G 82% /vol0 > > The devices may be ~80% full but that still leaves ~200G free on each. My > spark-env.sh has > > export SPARK_LOCAL_DIRS="/vol0/spark" > > I have manually verified that on each slave the only temporary files are > stored on /vol0, all looking something like this > > > /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884 > > So it looks like all the files are being stored on the large drives > (incidentally they're AWS EBS volumes, but that's the only way to get > enough storage). My process crashed before with a slightly different > exception under the same circumstances: kryo.KryoException: > java.io.IOException: No space left on device > > These both happen after several hours and several GB of temporary files. > > Why does Spark think it's run out of space? > > TIA > > Joe > > Stack trace 1: > > org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output > location for shuffle 0 > at > org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) > at > org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380) > at > org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:109) > at > org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1177) > at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:78) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787) > at > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) > at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:245) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > Stack trace 2: > > 15/02/22 02:47:08 WARN scheduler.TaskSetManager: Lost task 282.0 in stage > 25.1 (TID 22644): com.esotericsoftware. > kryo.KryoException: java.io.IOException: No space left on device > at com.esotericsoftware.kryo.io.Output.flush(Output.java:157) > at com.esotericsoftware.kryo.io.Output.require(Output.java:135) > at > com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446) > at com.esotericsoftware.kryo.io.Output.writeString(Output.java:306) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:153) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:146) > at > com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) > at carbonite.serializer$print_collection.invoke(serializer.clj:41) > at clojure.lang.Var.invoke(Var.java:423) > at > carbonite.ClojureCollSerializer.write(ClojureCollSerializer.java:19) > at > com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) > at > org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:130) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:303) > at > org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:254) > at > org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:83) > at > org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:87) > at > org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.maybeSpillCollection(ExternalSorter.scala:237) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:206) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:345) > at > org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream$$anonfun$write$3.apply$mcV$sp(BlockObjectWriter.scala:86) > at org.apache.spark.storage.DiskBlockObjectWriter.org > $apache$spark$storage$DiskBlockObjectWriter$$callWithTiming(BlockObjectWriter.scala:221) > at > org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream.write(BlockObjectWriter.scala:86) > at > java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) > at > org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:300) > at > org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:247) > at > org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:107) > > >