sorry i meant to say SPARK-18980

On Sat, Jan 21, 2017 at 1:48 AM, Koert Kuipers <ko...@tresata.com> wrote:

> found it :) SPARK-1890
> thanks cloud-fan
>
> On Sat, Jan 21, 2017 at 1:46 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> trying to replicate this in spark itself i can for v2.1.0 but not for
>> master. i guess it has been fixed
>>
>> On Fri, Jan 20, 2017 at 4:57 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i started printing out when kryo serializes my buffer data structure for
>>> my aggregator.
>>>
>>> i would expect every buffer object to ideally get serialized only once:
>>> at the end of the map-side before the shuffle (so after all the values for
>>> the given key within the partition have been reduced into it). i realize
>>> that in reality due to the order of the elements coming in this can not
>>> always be achieved. but what i see instead is that the buffer is getting
>>> serialized after every call to reduce a value into it, always. could this
>>> be the reason it is so slow?
>>>
>>> On Thu, Jan 19, 2017 at 4:17 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> we just converted a job from RDD to Dataset. the job does a single
>>>> map-red phase using aggregators. we are seeing very bad performance for the
>>>> Dataset version, about 10x slower.
>>>>
>>>> in the Dataset version we use kryo encoders for some of the
>>>> aggregators. based on some basic profiling of spark in local mode i believe
>>>> the bad performance is due to the kryo encoders. about 70% of time is spend
>>>> in kryo related classes.
>>>>
>>>> since we also use kryo for serialization with the RDD i am surprised
>>>> how big the performance difference is.
>>>>
>>>> has anyone seen the same thing? any suggestions for how to improve this?
>>>>
>>>>
>>>
>>
>

Reply via email to