Sean, thanks for your message.
On Wed, Jan 14, 2015 at 8:36 PM, Sean Owen <so...@cloudera.com> wrote: > On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer <t...@preferred.jp> wrote: > > OK, it seems like even on a local machine (with no network overhead), the > > groupByKey version is about 5 times slower than any of the other > > (reduceByKey, combineByKey etc.) functions... > > Even without network overhead, you're still paying the cost of setting > up the shuffle and serialization. > Can I pick an appropriate scheduler some time before so that Spark "knows" all items with the same key are on the same host? (Or enforce this?) Thanks Tobias