We’ve updated the user-facing API of groupBy in 1.0 to allow this: 
https://issues.apache.org/jira/browse/SPARK-1271. The ShuffleFetcher API is 
internal to Spark, it doesn’t really matter what it is because we can change 
it. But the problem before was that groupBy and cogroup were defined as 
returning (Key, Seq[Value]). Now they return (Key, Iterable[Value]), which will 
allow us to make the internal changes to allow spilling to disk within a key. 
This will happen after 1.0 though, but it will be doable without any changes to 
user programs.

Matei

On Apr 20, 2014, at 5:55 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> The issue isn't that the Iterator[P] can't be disk-backed.  It's that, with
> a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
> into memory at once.  The ShuffledRDD is agnostic to what goes inside P.
> 
> On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan <mri...@gmail.com>wrote:
> 
>> An iterator does not imply data has to be memory resident.
>> Think merge sort output as an iterator (disk backed).
>> 
>> Tom is actually planning to work on something similar with me on this
>> hopefully this or next month.
>> 
>> Regards,
>> Mridul
>> 
>> 
>> On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza <sandy.r...@cloudera.com>
>> wrote:
>>> Hey all,
>>> 
>>> After a shuffle / groupByKey, Hadoop MapReduce allows the values for a
>> key
>>> to not all fit in memory.  The current ShuffleFetcher.fetch API, which
>>> doesn't distinguish between keys and values, only returning an
>> Iterator[P],
>>> seems incompatible with this.
>>> 
>>> Any thoughts on how we could achieve parity here?
>>> 
>>> -Sandy
>> 

Reply via email to