Yes, writing to HDFS is more expensive, but I feel it is still a small price to pay when compared to having a Disk Space Full error three hours in and having to start from scratch.
The main goal of checkpointing is to truncate the lineage. Clearing up shuffle writes come as a bonus to checkpointing, it is not the main goal. The subtlety here is that .checkpoint() is just like .cache(). Until you call an action, nothing happens. Therefore, if you're going to do 1000 maps in a row and you don't want to checkpoint in the meantime until a shuffle happens, you will still get a StackOverflowError, because the lineage is too long. I went through some of the code for checkpointing. As far as I can tell, it materializes the data in HDFS, and resets all its dependencies, so you start a fresh lineage. My understanding would be that checkpointing still should be done every N operations to reset the lineage. However, an action must be performed before the lineage grows too long. I believe it would be nice to write up checkpointing in the programming guide. The reason that it's not there yet I believe is that most applications don't grow such a long lineage, except in Spark Streaming, and some MLlib algorithms. If you can help with the guide, I think it would be a nice feature to have! Burak ----- Original Message ----- From: "Andrew Ash" <and...@andrewash.com> To: "Burak Yavuz" <bya...@stanford.edu> Cc: "Макар Красноперов" <connector....@gmail.com>, "user" <user@spark.apache.org> Sent: Wednesday, September 17, 2014 11:04:02 AM Subject: Re: Spark and disk usage. Thanks for the info! Are there performance impacts with writing to HDFS instead of local disk? I'm assuming that's why ALS checkpoints every third iteration instead of every iteration. Also I can imagine that checkpointing should be done every N shuffles instead of every N operations (counting maps), since only the shuffle leaves data on disk. Do you have any suggestions on this? We should write up some guidance on the use of checkpointing in the programming guide <https://spark.apache.org/docs/latest/programming-guide.html> - I can help with this Andrew --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org