Re: Spark and disk usage.

Burak Yavuz Wed, 17 Sep 2014 12:09:22 -0700

Yes, writing to HDFS is more expensive, but I feel it is still a small price to 
pay when compared to having a Disk Space Full error three hours in
and having to start from scratch.


The main goal of checkpointing is to truncate the lineage. Clearing up shuffle 
writes come as a bonus to checkpointing, it is not the main goal. The 
subtlety here is that .checkpoint() is just like .cache(). Until you call an 
action, nothing happens. Therefore, if you're going to do 1000 maps in a 
row and you don't want to checkpoint in the meantime until a shuffle happens, 
you will still get a StackOverflowError, because the lineage is too long.

I went through some of the code for checkpointing. As far as I can tell, it 
materializes the data in HDFS, and resets all its dependencies, so you start 
a fresh lineage. My understanding would be that checkpointing still should be 
done every N operations to reset the lineage. However, an action must be 
performed before the lineage grows too long.

I believe it would be nice to write up checkpointing in the programming guide. 
The reason that it's not there yet I believe is that most applications don't
grow such a long lineage, except in Spark Streaming, and some MLlib algorithms. 
If you can help with the guide, I think it would be a nice feature to have!

Burak


----- Original Message -----
From: "Andrew Ash" <and...@andrewash.com>
To: "Burak Yavuz" <bya...@stanford.edu>
Cc: "Макар Красноперов" <connector....@gmail.com>, "user" 
<user@spark.apache.org>
Sent: Wednesday, September 17, 2014 11:04:02 AM
Subject: Re: Spark and disk usage.

Thanks for the info!

Are there performance impacts with writing to HDFS instead of local disk?
 I'm assuming that's why ALS checkpoints every third iteration instead of
every iteration.

Also I can imagine that checkpointing should be done every N shuffles
instead of every N operations (counting maps), since only the shuffle
leaves data on disk.  Do you have any suggestions on this?

We should write up some guidance on the use of checkpointing in the programming
guide <https://spark.apache.org/docs/latest/programming-guide.html> - I can
help with this

Andrew


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark and disk usage.

Reply via email to