I'm running a cluster of 3 Amazon EC2 machines (small number because it's
expensive when experiments keep crashing after a day!).

Today's crash looks like this (stacktrace at end of message).
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0

On my three nodes, I have plenty of space and inodes:

A $ df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/xvda1            524288   97937  426351   19% /
tmpfs                1909200       1 1909199    1% /dev/shm
/dev/xvdb            2457600      54 2457546    1% /mnt
/dev/xvdc            2457600      24 2457576    1% /mnt2
/dev/xvds            831869296   23844 831845452    1% /vol0

A $ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  3.4G  4.5G  44% /
tmpfs                 7.3G     0  7.3G   0% /dev/shm
/dev/xvdb              37G  1.2G   34G   4% /mnt
/dev/xvdc              37G  177M   35G   1% /mnt2
/dev/xvds            1000G  802G  199G  81% /vol0

B $ df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/xvda1            524288   97947  426341   19% /
tmpfs                1906639       1 1906638    1% /dev/shm
/dev/xvdb            2457600      54 2457546    1% /mnt
/dev/xvdc            2457600      24 2457576    1% /mnt2
/dev/xvds            816200704   24223 816176481    1% /vol0

B $ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  3.6G  4.3G  46% /
tmpfs                 7.3G     0  7.3G   0% /dev/shm
/dev/xvdb              37G  1.2G   34G   4% /mnt
/dev/xvdc              37G  177M   35G   1% /mnt2
/dev/xvds            1000G  805G  195G  81% /vol0

C $df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/xvda1            524288   97938  426350   19% /
tmpfs                1906897       1 1906896    1% /dev/shm
/dev/xvdb            2457600      54 2457546    1% /mnt
/dev/xvdc            2457600      24 2457576    1% /mnt2
/dev/xvds            755218352   24024 755194328    1% /vol0
root@ip-10-204-136-223 ~]$

C $ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  3.4G  4.5G  44% /
tmpfs                 7.3G     0  7.3G   0% /dev/shm
/dev/xvdb              37G  1.2G   34G   4% /mnt
/dev/xvdc              37G  177M   35G   1% /mnt2
/dev/xvds            1000G  820G  181G  82% /vol0

The devices may be ~80% full but that still leaves ~200G free on each. My
spark-env.sh has

export SPARK_LOCAL_DIRS="/vol0/spark"

I have manually verified that on each slave the only temporary files are
stored on /vol0, all looking something like this

/vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884

So it looks like all the files are being stored on the large drives
(incidentally they're AWS EBS volumes, but that's the only way to get
enough storage). My process crashed before with a slightly different
exception under the same circumstances: kryo.KryoException:
java.io.IOException: No space left on device

These both happen after several hours and several GB of temporary files.

Why does Spark think it's run out of space?

TIA

Joe

Stack trace 1:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
at
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:109)
at
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1177)
at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:78)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Stack trace 2:

15/02/22 02:47:08 WARN scheduler.TaskSetManager: Lost task 282.0 in stage
25.1 (TID 22644): com.esotericsoftware.
kryo.KryoException: java.io.IOException: No space left on device
        at com.esotericsoftware.kryo.io.Output.flush(Output.java:157)
        at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
        at
com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)
        at com.esotericsoftware.kryo.io.Output.writeString(Output.java:306)
        at
com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:153)
        at
com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:146)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        at carbonite.serializer$print_collection.invoke(serializer.clj:41)
        at clojure.lang.Var.invoke(Var.java:423)
        at
carbonite.ClojureCollSerializer.write(ClojureCollSerializer.java:19)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        at
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:130)
        at
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
        at
org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:303)
        at
org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:254)
        at
org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:83)
        at
org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:87)
        at
org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:83)
        at
org.apache.spark.util.collection.ExternalSorter.maybeSpillCollection(ExternalSorter.scala:237)
        at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:206)
        at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:345)
        at
org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream$$anonfun$write$3.apply$mcV$sp(BlockObjectWriter.scala:86)
        at org.apache.spark.storage.DiskBlockObjectWriter.org
$apache$spark$storage$DiskBlockObjectWriter$$callWithTiming(BlockObjectWriter.scala:221)
        at
org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream.write(BlockObjectWriter.scala:86)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at
org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:300)
        at
org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:247)
        at
org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:107)

Reply via email to