Thanks Silvio.
On Mon, Jun 29, 2015 at 7:41 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:
> Regarding 1 and 2, yes shuffle output is stored on the worker local
> disks and will be reused across jobs as long as they’re available. You can
> identify when they’re used by seeing skipp
Regarding 1 and 2, yes shuffle output is stored on the worker local disks and
will be reused across jobs as long as they’re available. You can identify when
they’re used by seeing skipped stages in the job UI. They are periodically
cleaned up based on available space of the configured spark.loca
Ah, for #3, maybe this is what *rdd.checkpoint *does!
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
Thomas
On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber
wrote:
> Hello,
>
> It is my understanding that shuffle are written on disk and that they act
> as chec