Re: Newbie question: can shuffle avoid writing and reading from disk?

Muler Wed, 05 Aug 2015 18:22:12 -0700

Thanks!

On Wed, Aug 5, 2015 at 5:24 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:


> Yes, finally shuffle data will be written to disk for reduce stage to
> pull, no matter how large you set to shuffle memory fraction.
>
> Thanks
> Saisai
>
> On Thu, Aug 6, 2015 at 7:50 AM, Muler <mulugeta.abe...@gmail.com> wrote:
>
>> thanks, so if I have enough large memory (with enough
>> spark.shuffle.memory) then shuffle (in-memory shuffle) spill doesn't happen
>> (per node) but still shuffle data has to be ultimately written to disk so
>> that reduce stage pulls if across network?
>>
>> On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao <sai.sai.s...@gmail.com>
>> wrote:
>>
>>> Hi Muler,
>>>
>>> Shuffle data will be written to disk, no matter how large memory you
>>> have, large memory could alleviate shuffle spill where temporary file will
>>> be generated if memory is not enough.
>>>
>>> Yes, each node writes shuffle data to file and pulled from disk in
>>> reduce stage from network framework (default is Netty).
>>>
>>> Thanks
>>> Saisai
>>>
>>> On Thu, Aug 6, 2015 at 7:10 AM, Muler <mulugeta.abe...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Consider I'm running WordCount with 100m of data on 4 node cluster.
>>>> Assuming my RAM size on each node is 200g and i'm giving my executors 100g
>>>> (just enough memory for 100m data)
>>>>
>>>>
>>>>    1. If I have enough memory, can Spark 100% avoid writing to disk?
>>>>    2. During shuffle, where results have to be collected from nodes,
>>>>    does each node write to disk and then the results are pulled from disk? 
>>>> If
>>>>    not, what is the API that is being used to pull data from nodes across 
>>>> the
>>>>    cluster? (I'm thinking what Scala or Java packages would allow you to 
>>>> read
>>>>    in-memory data from other machines?)
>>>>
>>>> Thanks,
>>>>
>>>
>>>
>>
>

Re: Newbie question: can shuffle avoid writing and reading from disk?

Reply via email to