Root partitions on AWS instances tend to be small (for example, an m1.large instance has 2 420 GB drives, but only a 10 GB root partition). Matei's probably right on about this - just need to be careful where things like the logs get stored.
From: Matei Zaharia <matei.zaha...@gmail.com<mailto:matei.zaha...@gmail.com>> Date: Saturday, August 9, 2014 at 1:48 PM To: "u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>" <u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>>, kmatzen <kmat...@gmail.com<mailto:kmat...@gmail.com>> Subject: Re: No space left on device Your map-only job should not be shuffling, but if you want to see what's running, look at the web UI at http://<driver>:4040. In fact the job should not even write stuff to disk except inasmuch as the Hadoop S3 library might build up blocks locally before sending them on. My guess is that it's not /mnt or /mnt2 that get filled, but the root volume, /, either with logs or with temp files created by the Hadoop S3 library. You can check this by running "df" while the job is executing. (Tools like Ganglia can probably also log this.) If it is the logs, you can symlink the spark/logs directory to someplace on /mnt instead. If it's /tmp, you can set java.io.tmpdir to another directory in Spark's JVM options. Matei On August 8, 2014 at 11:02:48 PM, kmatzen (kmat...@gmail.com<mailto:kmat...@gmail.com>) wrote: I need some configuration / debugging recommendations to work around "no space left on device". I am completely new to Spark, but I have some experience with Hadoop. I have a task where I read images stored in sequence files from s3://, process them with a map in scala, and write the result back to s3://. I have about 15 r3.8xlarge instances allocated with the included ec2 script. The input data is about 1.5 TB and I expect the output to be similarly sized. 15 r3.8xlarge instances give me about 3 TB of RAM and 9 TB of storage, so hopefully more than enough for this task. What happens is that it takes about an hour to read in the input from S3. Once that is complete, then it begins to process the images and several succeed. However, quickly, the job fails with "no space left on device". By time I can ssh into one of the machines that reported the error, temp files have already been cleaned up. I don't see any more detailed messages in the slave logs. I have not yet changed the logging configuration from the default. The S3 input and output are cached in /mnt/ephemeral-hdfs/s3 and /mnt2/ephemeral-hdfs/s3 (I see mostly input files at the time of failure, but maybe 1 output file per slave). Shuffle files are generated in /mnt/spark/<something> and /mnt2/spark/<something> (they were cleaned up once the job failed and I don't remember the directory that I saw while it was still running). I checked the disk utilization for a few slaves while running the pipeline and they were pretty far away from being full. But the failure probably came from a slave that was overloaded from a shard imbalance (but why would that happen on read -> map -> write?). What other things might I need to configure to prevent this error? What logging options do people recommend? Is there an easy way to diagnose spark failures from the web interface like with Hadoop? I need to do some more testing to make sure I'm not emitting a giant image for a malformed input image, but I figured I'd post this question early in case anyone had any recommendations. BTW, why does a map-only job need to shuffle? I was expecting it to pipeline the transfer in from S3 operation, the actual computation operation, and the transfer back out to S3 operation rather than doing everything serially with a giant disk footprint. Actually, I was thinking it would fuse all three operations into a single stage. Is that not what Spark does? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-space-left-on-device-tp11829.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>