Hi Andrew,

Yes, I'm referring to sparkContext.setCheckpointDir(). It has the same effect 
as in Spark Streaming. 
For example, in an algorithm like ALS, the RDDs go through many transformations 
and the lineage of the RDD starts to grow drastically just like 
the lineage of DStreams do in Spark Streaming. You may observe 
StackOverflowErrors in ALS if you set the number of iterations to be very high. 

If you set the checkpointing directory however, the intermediate state of the 
RDDs will be saved in HDFS, and the lineage will pick off from there. 
You won't need to keep the shuffle data before the checkpointed state, 
therefore those can be safely removed (will be removed automatically).
However, checkpoint must be called explicitly as in 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291
 ,just setting the directory will not be enough.

Best,
Burak

----- Original Message -----
From: "Andrew Ash" <and...@andrewash.com>
To: "Burak Yavuz" <bya...@stanford.edu>
Cc: "Макар Красноперов" <connector....@gmail.com>, "user" 
<user@spark.apache.org>
Sent: Wednesday, September 17, 2014 10:19:42 AM
Subject: Re: Spark and disk usage.

Hi Burak,

Most discussions of checkpointing in the docs is related to Spark
streaming.  Are you talking about the sparkContext.setCheckpointDir()?
 What effect does that have?

https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

On Wed, Sep 17, 2014 at 7:44 AM, Burak Yavuz <bya...@stanford.edu> wrote:

> Hi,
>
> The files you mentioned are temporary files written by Spark during
> shuffling. ALS will write a LOT of those files as it is a shuffle heavy
> algorithm.
> Those files will be deleted after your program completes as Spark looks
> for those files in case a fault occurs. Having those files ready allows
> Spark to
> continue from the stage the shuffle left off, instead of starting from the
> very beginning.
>
> Long story short, it's to your benefit that Spark writes those files to
> disk. If you don't want Spark writing to disk, you can specify a checkpoint
> directory in
> HDFS, where Spark will write the current status instead and will clean up
> files from disk.
>
> Best,
> Burak
>
> ----- Original Message -----
> From: "Макар Красноперов" <connector....@gmail.com>
> To: user@spark.apache.org
> Sent: Wednesday, September 17, 2014 7:37:49 AM
> Subject: Spark and disk usage.
>
> Hello everyone.
>
> The problem is that spark write data to the disk very hard, even if
> application has a lot of free memory (about 3.8g).
> So, I've noticed that folder with name like
> "spark-local-20140917165839-f58c" contains a lot of other folders with
> files like "shuffle_446_0_1". The total size of files in the dir
> "spark-local-20140917165839-f58c" can reach 1.1g.
> Sometimes its size decreases (are there only temp files in that folder?),
> so the totally amount of data written to the disk is greater than 1.1g.
>
> The question is what kind of data Spark store there and can I make spark
> not to write it on the disk and just keep it in the memory if there is
> enough RAM free space?
>
> I run my job locally with Spark 1.0.1:
> ./bin/spark-submit --driver-memory 12g --master local[3] --properties-file
> conf/spark-defaults.conf --class my.company.Main /path/to/jar/myJob.jar
>
> spark-defaults.conf :
> spark.shuffle.spill             false
> spark.reducer.maxMbInFlight     1024
> spark.shuffle.file.buffer.kb    2048
> spark.storage.memoryFraction    0.7
>
> The situation with disk usage is common for many jobs. I had also used ALS
> from MLIB and saw the similar things.
>
> I had reached no success by playing with spark configuration and i hope
> someone can help me :)
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to