Shuffle files lifecycle

Thomas Gerber Mon, 29 Jun 2015 19:13:00 -0700

Hello,

It is my understanding that shuffle are written on disk and that they act
as checkpoints.


I wonder if this is true only within a job, or across jobs. Please note
that I use the words job and stage carefully here.

1. can a shuffle created during JobN be used to skip many stages from
JobN+1? Or is the lifecycle of the shuffle files bound to the job that
created them?

2. when are shuffle files actually deleted? Is it TTL based or is it
cleaned when the job is over?

3. we have a very long batch application, and as it goes on, the number of
total tasks for each job gets larger and larger. It is not really a
problem, because most of those tasks will be skipped since we cache RDDs.
We noticed however that there is a delay in the actual start of a job of 1
min for every 2M tasks in your job. Are there suggested workarounds to
avoid that delay? Maybe saving the RDD and re-loading it?

Thanks
Thomas

Shuffle files lifecycle

Reply via email to