There's been some discussion of this on this JIRA <https://issues.apache.org/jira/browse/SPARK-3376> and the associated PR. The short summary is that, in theory / to a VERY rough approximation, the OS buffer cache does everything we'd want an in-memory shuffle to do, and is simple.
On Wed, Mar 29, 2017 at 3:19 AM, Effi Ofer <effi.o...@gmail.com> wrote: > Greetings, I was wondering why Spark's Shuffler always persists the > shuffle data to disk? I understand that the persisted data can be used by > the scheduler to truncate the lineage of the RDD graph if an existing RDD > has been materialized as a side effect of an earlier shuffle. But that > does not explain why Spark is not keeping the shuffle RDD in memory until > memory becomes sufficiently low to trigger victim selection and spilling. > Any hints and pointers would be appreciated. > > Thanks, > Effi > >