I agree, this is better handled by the filesystem cache - not to mention, being able to do zero copy writes.
Regards, Mridul On Sat, May 2, 2015 at 10:26 PM, Reynold Xin <r...@databricks.com> wrote: > I've personally prototyped completely in-memory shuffle for Spark 3 times. > However, it is unclear how big of a gain it would be to put all of these in > memory, under newer file systems (ext4, xfs). If the shuffle data is small, > they are still in the file system buffer cache anyway. Note that network > throughput is often lower than disk throughput, so it won't be a problem to > read them from disk. And not having to keep all of these stuff in-memory > substantially simplifies memory management. > > > > On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri <pramodbilig...@gmail.com> > wrote: > >> Hi, >> I was trying to see if I can make Spark avoid hitting the disk for small >> jobs, but I see that the SortShuffleWriter.write() always writes to disk. I >> found an older thread ( >> >> http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html >> ) >> saying that it doesn't call fsync on this write path. >> >> My question is why does it always write to disk? >> Does it mean the reduce phase reads the result from the disk as well? >> Isn't it possible to read the data from map/buffer in ExternalSorter >> directly during the reduce phase? >> >> Thanks, >> Pramod >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org