Why is shuffle data always persisted to disk?

Effi Ofer Wed, 29 Mar 2017 03:20:53 -0700

Greetings, I was wondering why Spark's Shuffler always persists the shuffle
data to disk?  I understand that the persisted data can be used by the
scheduler to truncate the lineage of the RDD graph if an existing RDD has
been materialized as a side effect of an earlier shuffle.  But that does
not explain why Spark is not keeping the shuffle RDD in memory until memory
becomes sufficiently low to trigger victim selection and spilling.  Any
hints and pointers would be appreciated.


Thanks,
Effi

Why is shuffle data always persisted to disk?

Reply via email to