I need some configuration / debugging recommendations to work around "no
space left on device".  I am completely new to Spark, but I have some
experience with Hadoop.

I have a task where I read images stored in sequence files from s3://,
process them with a map in scala, and write the result back to s3://.  I
have about 15 r3.8xlarge instances allocated with the included ec2 script. 
The input data is about 1.5 TB and I expect the output to be similarly
sized.  15 r3.8xlarge instances give me about 3 TB of RAM and 9 TB of
storage, so hopefully more than enough for this task.

What happens is that it takes about an hour to read in the input from S3. 
Once that is complete, then it begins to process the images and several
succeed.  However, quickly, the job fails with "no space left on device". 
By time I can ssh into one of the machines that reported the error, temp
files have already been cleaned up.  I don't see any more detailed messages
in the slave logs.  I have not yet changed the logging configuration from
the default.

The S3 input and output are cached in /mnt/ephemeral-hdfs/s3 and
/mnt2/ephemeral-hdfs/s3 (I see mostly input files at the time of failure,
but maybe 1 output file per slave).  Shuffle files are generated in
/mnt/spark/<something> and /mnt2/spark/<something> (they were cleaned up
once the job failed and I don't remember the directory that I saw while it
was still running).  I checked the disk utilization for a few slaves while
running the pipeline and they were pretty far away from being full.  But the
failure probably came from a slave that was overloaded from a shard
imbalance (but why would that happen on read -> map -> write?).

What other things might I need to configure to prevent this error?  What
logging options do people recommend?  Is there an easy way to diagnose spark
failures from the web interface like with Hadoop?

I need to do some more testing to make sure I'm not emitting a giant image
for a malformed input image, but I figured I'd post this question early in
case anyone had any recommendations.

BTW, why does a map-only job need to shuffle?  I was expecting it to
pipeline the transfer in from S3 operation, the actual computation
operation, and the transfer back out to S3 operation rather than doing
everything serially with a giant disk footprint.  Actually, I was thinking
it would fuse all three operations into a single stage.  Is that not what
Spark does?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-space-left-on-device-tp11829.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to