Re: Why does SortShuffleWriter write to disk always?

Mridul Muralidharan Sat, 02 May 2015 22:38:24 -0700

I agree, this is better handled by the filesystem cache - not to
mention, being able to do zero copy writes.


Regards,
Mridul

On Sat, May 2, 2015 at 10:26 PM, Reynold Xin <r...@databricks.com> wrote:
> I've personally prototyped completely in-memory shuffle for Spark 3 times.
> However, it is unclear how big of a gain it would be to put all of these in
> memory, under newer file systems (ext4, xfs). If the shuffle data is small,
> they are still in the file system buffer cache anyway. Note that network
> throughput is often lower than disk throughput, so it won't be a problem to
> read them from disk. And not having to keep all of these stuff in-memory
> substantially simplifies memory management.
>
>
>
> On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri <pramodbilig...@gmail.com>
> wrote:
>
>> Hi,
>> I was trying to see if I can make Spark avoid hitting the disk for small
>> jobs, but I see that the SortShuffleWriter.write() always writes to disk. I
>> found an older thread (
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html
>> )
>> saying that it doesn't call fsync on this write path.
>>
>> My question is why does it always write to disk?
>> Does it mean the reduce phase reads the result from the disk as well?
>> Isn't it possible to read the data from map/buffer in ExternalSorter
>> directly during the reduce phase?
>>
>> Thanks,
>> Pramod
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Why does SortShuffleWriter write to disk always?

Reply via email to