Thanks! On Wed, Aug 5, 2015 at 5:24 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:
> Yes, finally shuffle data will be written to disk for reduce stage to > pull, no matter how large you set to shuffle memory fraction. > > Thanks > Saisai > > On Thu, Aug 6, 2015 at 7:50 AM, Muler <mulugeta.abe...@gmail.com> wrote: > >> thanks, so if I have enough large memory (with enough >> spark.shuffle.memory) then shuffle (in-memory shuffle) spill doesn't happen >> (per node) but still shuffle data has to be ultimately written to disk so >> that reduce stage pulls if across network? >> >> On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao <sai.sai.s...@gmail.com> >> wrote: >> >>> Hi Muler, >>> >>> Shuffle data will be written to disk, no matter how large memory you >>> have, large memory could alleviate shuffle spill where temporary file will >>> be generated if memory is not enough. >>> >>> Yes, each node writes shuffle data to file and pulled from disk in >>> reduce stage from network framework (default is Netty). >>> >>> Thanks >>> Saisai >>> >>> On Thu, Aug 6, 2015 at 7:10 AM, Muler <mulugeta.abe...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Consider I'm running WordCount with 100m of data on 4 node cluster. >>>> Assuming my RAM size on each node is 200g and i'm giving my executors 100g >>>> (just enough memory for 100m data) >>>> >>>> >>>> 1. If I have enough memory, can Spark 100% avoid writing to disk? >>>> 2. During shuffle, where results have to be collected from nodes, >>>> does each node write to disk and then the results are pulled from disk? >>>> If >>>> not, what is the API that is being used to pull data from nodes across >>>> the >>>> cluster? (I'm thinking what Scala or Java packages would allow you to >>>> read >>>> in-memory data from other machines?) >>>> >>>> Thanks, >>>> >>> >>> >> >