Just wanted to mention - one common thing I've seen users do is use
groupByKey, then do something that is commutitive and associative once the
values are grouped. Really users here should be doing reduceByKey.

rdd.groupByKey().map{ case (key, values) => (key, values.sum))
rdd.reduceByKey(_ + _)

I've seen this happen particularly for users coming from MapReduce where
they are used to having to write their own combiners and it's not intuitive
that these functions are very different.

Sandy - have you heard from users who have a specific problems they can't
solve using an associative function? I'm sure they exist, but I wonder how
often it's this vs. they just don't understand they API.

I wonder if we should actually warn about this in the groupByKey
documentation.

- Patrick


On Sun, Apr 20, 2014 at 8:13 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> We've updated the user-facing API of groupBy in 1.0 to allow this:
> https://issues.apache.org/jira/browse/SPARK-1271. The ShuffleFetcher API
> is internal to Spark, it doesn't really matter what it is because we can
> change it. But the problem before was that groupBy and cogroup were defined
> as returning (Key, Seq[Value]). Now they return (Key, Iterable[Value]),
> which will allow us to make the internal changes to allow spilling to disk
> within a key. This will happen after 1.0 though, but it will be doable
> without any changes to user programs.
>
> Matei
>
> On Apr 20, 2014, at 5:55 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
>
> > The issue isn't that the Iterator[P] can't be disk-backed.  It's that,
> with
> > a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
> > into memory at once.  The ShuffledRDD is agnostic to what goes inside P.
> >
> > On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan <mri...@gmail.com
> >wrote:
> >
> >> An iterator does not imply data has to be memory resident.
> >> Think merge sort output as an iterator (disk backed).
> >>
> >> Tom is actually planning to work on something similar with me on this
> >> hopefully this or next month.
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >> On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza <sandy.r...@cloudera.com>
> >> wrote:
> >>> Hey all,
> >>>
> >>> After a shuffle / groupByKey, Hadoop MapReduce allows the values for a
> >> key
> >>> to not all fit in memory.  The current ShuffleFetcher.fetch API, which
> >>> doesn't distinguish between keys and values, only returning an
> >> Iterator[P],
> >>> seems incompatible with this.
> >>>
> >>> Any thoughts on how we could achieve parity here?
> >>>
> >>> -Sandy
> >>
>
>

Reply via email to