I need some configuration / debugging recommendations to work around "no space left on device". I am completely new to Spark, but I have some experience with Hadoop.
I have a task where I read images stored in sequence files from s3://, process them with a map in scala, and write the result back to s3://. I have about 15 r3.8xlarge instances allocated with the included ec2 script. The input data is about 1.5 TB and I expect the output to be similarly sized. 15 r3.8xlarge instances give me about 3 TB of RAM and 9 TB of storage, so hopefully more than enough for this task. What happens is that it takes about an hour to read in the input from S3. Once that is complete, then it begins to process the images and several succeed. However, quickly, the job fails with "no space left on device". By time I can ssh into one of the machines that reported the error, temp files have already been cleaned up. I don't see any more detailed messages in the slave logs. I have not yet changed the logging configuration from the default. The S3 input and output are cached in /mnt/ephemeral-hdfs/s3 and /mnt2/ephemeral-hdfs/s3 (I see mostly input files at the time of failure, but maybe 1 output file per slave). Shuffle files are generated in /mnt/spark/<something> and /mnt2/spark/<something> (they were cleaned up once the job failed and I don't remember the directory that I saw while it was still running). I checked the disk utilization for a few slaves while running the pipeline and they were pretty far away from being full. But the failure probably came from a slave that was overloaded from a shard imbalance (but why would that happen on read -> map -> write?). What other things might I need to configure to prevent this error? What logging options do people recommend? Is there an easy way to diagnose spark failures from the web interface like with Hadoop? I need to do some more testing to make sure I'm not emitting a giant image for a malformed input image, but I figured I'd post this question early in case anyone had any recommendations. BTW, why does a map-only job need to shuffle? I was expecting it to pipeline the transfer in from S3 operation, the actual computation operation, and the transfer back out to S3 operation rather than doing everything serially with a giant disk footprint. Actually, I was thinking it would fuse all three operations into a single stage. Is that not what Spark does? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-space-left-on-device-tp11829.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org