found it :) SPARK-1890 thanks cloud-fan On Sat, Jan 21, 2017 at 1:46 AM, Koert Kuipers <ko...@tresata.com> wrote:
> trying to replicate this in spark itself i can for v2.1.0 but not for > master. i guess it has been fixed > > On Fri, Jan 20, 2017 at 4:57 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> i started printing out when kryo serializes my buffer data structure for >> my aggregator. >> >> i would expect every buffer object to ideally get serialized only once: >> at the end of the map-side before the shuffle (so after all the values for >> the given key within the partition have been reduced into it). i realize >> that in reality due to the order of the elements coming in this can not >> always be achieved. but what i see instead is that the buffer is getting >> serialized after every call to reduce a value into it, always. could this >> be the reason it is so slow? >> >> On Thu, Jan 19, 2017 at 4:17 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> we just converted a job from RDD to Dataset. the job does a single >>> map-red phase using aggregators. we are seeing very bad performance for the >>> Dataset version, about 10x slower. >>> >>> in the Dataset version we use kryo encoders for some of the aggregators. >>> based on some basic profiling of spark in local mode i believe the bad >>> performance is due to the kryo encoders. about 70% of time is spend in kryo >>> related classes. >>> >>> since we also use kryo for serialization with the RDD i am surprised how >>> big the performance difference is. >>> >>> has anyone seen the same thing? any suggestions for how to improve this? >>> >>> >> >