From: "Andrew Ash"
> To: "Burak Yavuz"
> Cc: "Макар Красноперов" , "user" <
> user@spark.apache.org>
> Sent: Wednesday, September 17, 2014 11:04:02 AM
> Subject: Re: Spark and disk usage.
>
> Thanks for the info!
>
> Ar
age, except in Spark Streaming, and some MLlib algorithms.
If you can help with the guide, I think it would be a nice feature to have!
Burak
- Original Message -
From: "Andrew Ash"
To: "Burak Yavuz"
Cc: "Макар Красноперов" , "user"
Sent: Wednesday
Thanks for the info!
Are there performance impacts with writing to HDFS instead of local disk?
I'm assuming that's why ALS checkpoints every third iteration instead of
every iteration.
Also I can imagine that checkpointing should be done every N shuffles
instead of every N operations (counting m
etting the directory will not be enough.
Best,
Burak
- Original Message -
From: "Andrew Ash"
To: "Burak Yavuz"
Cc: "Макар Красноперов" , "user"
Sent: Wednesday, September 17, 2014 10:19:42 AM
Subject: Re: Spark and disk usage.
Hi Burak,
Most discussion
Hi Burak,
Most discussions of checkpointing in the docs is related to Spark
streaming. Are you talking about the sparkContext.setCheckpointDir()?
What effect does that have?
https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
On Wed, Sep 17, 2014 at 7:44 AM, Bur
Hi,
The files you mentioned are temporary files written by Spark during shuffling.
ALS will write a LOT of those files as it is a shuffle heavy algorithm.
Those files will be deleted after your program completes as Spark looks for
those files in case a fault occurs. Having those files ready allo