Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Thanks Silvio. On Mon, Jun 29, 2015 at 7:41 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Regarding 1 and 2, yes shuffle output is stored on the worker local > disks and will be reused across jobs as long as they’re available. You can > identify when they’re used by seeing skipp

Re: Shuffle files lifecycle

2015-06-29 Thread Silvio Fiorito
Regarding 1 and 2, yes shuffle output is stored on the worker local disks and will be reused across jobs as long as they’re available. You can identify when they’re used by seeing skipped stages in the job UI. They are periodically cleaned up based on available space of the configured spark.loca

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Ah, for #3, maybe this is what *rdd.checkpoint *does! https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD Thomas On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber wrote: > Hello, > > It is my understanding that shuffle are written on disk and that they act > as chec