Re: Why is shuffle data always persisted to disk?

Kay Ousterhout Wed, 29 Mar 2017 14:01:42 -0700

There's been some discussion of this on this JIRA
<https://issues.apache.org/jira/browse/SPARK-3376> and the associated PR.
The short summary is that, in theory / to a VERY rough approximation, the
OS buffer cache does everything we'd want an in-memory shuffle to do, and
is simple.


On Wed, Mar 29, 2017 at 3:19 AM, Effi Ofer <effi.o...@gmail.com> wrote:

> Greetings, I was wondering why Spark's Shuffler always persists the
> shuffle data to disk?  I understand that the persisted data can be used by
> the scheduler to truncate the lineage of the RDD graph if an existing RDD
> has been materialized as a side effect of an earlier shuffle.  But that
> does not explain why Spark is not keeping the shuffle RDD in memory until
> memory becomes sufficiently low to trigger victim selection and spilling.
> Any hints and pointers would be appreciated.
>
> Thanks,
> Effi
>
>

Re: Why is shuffle data always persisted to disk?

Reply via email to