Re: *ByKey aggregations: performance + order

Tobias Pfeiffer Wed, 14 Jan 2015 17:31:03 -0800

Sean,

thanks for your message.

On Wed, Jan 14, 2015 at 8:36 PM, Sean Owen <so...@cloudera.com> wrote:

> On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer <t...@preferred.jp> wrote:
> > OK, it seems like even on a local machine (with no network overhead), the
> > groupByKey version is about 5 times slower than any of the other
> > (reduceByKey, combineByKey etc.) functions...
>
> Even without network overhead, you're still paying the cost of setting
> up the shuffle and serialization.
>

Can I pick an appropriate scheduler some time before so that Spark "knows"
all items with the same key are on the same host? (Or enforce this?)

Thanks
Tobias

Re: *ByKey aggregations: performance + order

Reply via email to