016 at 6:28 AM, Sea <261810...@qq.com> wrote:
>>>
>>>> Hi,Corey:
>>>>"The dataset is 100gb at most, the spills can up to 10T-100T", Are
>>>> your input files lzo format, and you use sc.text() ? If memory is not
>>>&g
ey:
>>>"The dataset is 100gb at most, the spills can up to 10T-100T", Are
>>> your input files lzo format, and you use sc.text() ? If memory is not
>>> enough, spark will spill 3-4x of input data to disk.
>>>
>>>
>>>
:
>>"The dataset is 100gb at most, the spills can up to 10T-100T", Are
>> your input files lzo format, and you use sc.text() ? If memory is not
>> enough, spark will spill 3-4x of input data to disk.
>>
>>
>> -- 原始邮件 --------
is not enough,
> spark will spill 3-4x of input data to disk.
>
>
> -- 原始邮件 --
> *发件人:* "Corey Nolet";;
> *发送时间:* 2016年2月7日(星期天) 晚上8:56
> *收件人:* "Igor Berman";
> *抄送:* "user";
> *主题:* Re: Shuffle memory woes
>
As for the second part of your questions- we have a fairly complex join
process which requires a ton of stage orchestration from our driver. I've
written some code to be able to walk down our DAG tree and execute siblings
in the tree concurrently where possible (forcing cache to disk on children
th
Igor,
I don't think the question is "why can't it fit stuff in memory". I know
why it can't fit stuff in memory- because it's a large dataset that needs
to have a reduceByKey() run on it. My understanding is that when it doesn't
fit into memory it needs to spill in order to consolidate intermediar
so can you provide code snippets: especially it's interesting to see what
are your transformation chain, how many partitions are there on each side
of shuffle operation
the question is why it can't fit stuff in memory when you are shuffling -
maybe your partitioner on "reduce" side is not configur
Igor,
Thank you for the response but unfortunately, the problem I'm referring to
goes beyond this. I have set the shuffle memory fraction to be 90% and set
the cache memory to be 0. Repartitioning the RDD helped a tad on the map
side but didn't do much for the spilling when there was no longer any
Hi,
usually you can solve this by 2 steps
make rdd to have more partitions
play with shuffle memory fraction
in spark 1.6 cache vs shuffle memory fractions are adjusted automatically
On 5 February 2016 at 23:07, Corey Nolet wrote:
> I just recently had a discovery that my jobs were taking sever