Re: No space left on device

Jim Donahue Sat, 09 Aug 2014 14:22:29 -0700

Root partitions on AWS instances tend to be small (for example, an m1.large 
instance has 2 420 GB drives, but only a 10 GB root partition).  Matei's 
probably right on about this - just need to be careful where things like the 
logs get stored.

From: Matei Zaharia <matei.zaha...@gmail.com<mailto:matei.zaha...@gmail.com>>
Date: Saturday, August 9, 2014 at 1:48 PM
To: "u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>" 
<u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>>, 
kmatzen <kmat...@gmail.com<mailto:kmat...@gmail.com>>
Subject: Re: No space left on device

Your map-only job should not be shuffling, but if you want to see what's 
running, look at the web UI at http://<driver>:4040. In fact the job should not 
even write stuff to disk except inasmuch as the Hadoop S3 library might build 
up blocks locally before sending them on.

My guess is that it's not /mnt or /mnt2 that get filled, but the root volume, 
/, either with logs or with temp files created by the Hadoop S3 library. You 
can check this by running "df" while the job is executing. (Tools like Ganglia 
can probably also log this.) If it is the logs, you can symlink the spark/logs 
directory to someplace on /mnt instead. If it's /tmp, you can set 
java.io.tmpdir to another directory in Spark's JVM options.

Matei

On August 8, 2014 at 11:02:48 PM, kmatzen 
(kmat...@gmail.com<mailto:kmat...@gmail.com>) wrote:

I need some configuration / debugging recommendations to work around "no
space left on device". I am completely new to Spark, but I have some
experience with Hadoop.

I have a task where I read images stored in sequence files from s3://,
process them with a map in scala, and write the result back to s3://. I
have about 15 r3.8xlarge instances allocated with the included ec2 script.
The input data is about 1.5 TB and I expect the output to be similarly
sized. 15 r3.8xlarge instances give me about 3 TB of RAM and 9 TB of
storage, so hopefully more than enough for this task.

What happens is that it takes about an hour to read in the input from S3.
Once that is complete, then it begins to process the images and several
succeed. However, quickly, the job fails with "no space left on device".
By time I can ssh into one of the machines that reported the error, temp
files have already been cleaned up. I don't see any more detailed messages
in the slave logs. I have not yet changed the logging configuration from
the default.

The S3 input and output are cached in /mnt/ephemeral-hdfs/s3 and
/mnt2/ephemeral-hdfs/s3 (I see mostly input files at the time of failure,
but maybe 1 output file per slave). Shuffle files are generated in
/mnt/spark/<something> and /mnt2/spark/<something> (they were cleaned up
once the job failed and I don't remember the directory that I saw while it
was still running). I checked the disk utilization for a few slaves while
running the pipeline and they were pretty far away from being full. But the
failure probably came from a slave that was overloaded from a shard
imbalance (but why would that happen on read -> map -> write?).

What other things might I need to configure to prevent this error? What
logging options do people recommend? Is there an easy way to diagnose spark
failures from the web interface like with Hadoop?

I need to do some more testing to make sure I'm not emitting a giant image
for a malformed input image, but I figured I'd post this question early in
case anyone had any recommendations.

BTW, why does a map-only job need to shuffle? I was expecting it to
pipeline the transfer in from S3 operation, the actual computation
operation, and the transfer back out to S3 operation rather than doing
everything serially with a giant disk footprint. Actually, I was thinking
it would fuse all three operations into a single stage. Is that not what
Spark does?

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-space-left-on-device-tp11829.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Re: No space left on device

Reply via email to