Re: Spark and disk usage.

2014-09-21 Thread Andrew Ash
From: "Andrew Ash" > To: "Burak Yavuz" > Cc: "Макар Красноперов" , "user" < > user@spark.apache.org> > Sent: Wednesday, September 17, 2014 11:04:02 AM > Subject: Re: Spark and disk usage. > > Thanks for the info! > > Ar

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
age, except in Spark Streaming, and some MLlib algorithms. If you can help with the guide, I think it would be a nice feature to have! Burak - Original Message - From: "Andrew Ash" To: "Burak Yavuz" Cc: "Макар Красноперов" , "user" Sent: Wednesday

Re: Spark and disk usage.

2014-09-17 Thread Andrew Ash
Thanks for the info! Are there performance impacts with writing to HDFS instead of local disk? I'm assuming that's why ALS checkpoints every third iteration instead of every iteration. Also I can imagine that checkpointing should be done every N shuffles instead of every N operations (counting m

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
etting the directory will not be enough. Best, Burak - Original Message - From: "Andrew Ash" To: "Burak Yavuz" Cc: "Макар Красноперов" , "user" Sent: Wednesday, September 17, 2014 10:19:42 AM Subject: Re: Spark and disk usage. Hi Burak, Most discussion

Re: Spark and disk usage.

2014-09-17 Thread Andrew Ash
Hi Burak, Most discussions of checkpointing in the docs is related to Spark streaming. Are you talking about the sparkContext.setCheckpointDir()? What effect does that have? https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing On Wed, Sep 17, 2014 at 7:44 AM, Bur

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
Hi, The files you mentioned are temporary files written by Spark during shuffling. ALS will write a LOT of those files as it is a shuffle heavy algorithm. Those files will be deleted after your program completes as Spark looks for those files in case a fault occurs. Having those files ready allo