Re: all values for a key must fit in memory

2014-05-25 Thread Nilesh
Iterator implementation works OK for me here, though it might turn out to be slow. Cheers, Nilesh PS: Can't wait for 1.0! ^_^ Looks like it's been RC10 till now. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-

Re: all values for a key must fit in memory

2014-05-25 Thread Patrick Wendell
gt; > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6794.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: all values for a key must fit in memory

2014-05-25 Thread Nilesh
om 1.0, the logic of which thankfully works with 0.9.1 too, no new API changes there. Cheers, Nilesh -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6794.html Sent from the Apache Spark Developers List mai

Re: all values for a key must fit in memory

2014-05-25 Thread Andrew Ash
propose a workaround for this for the meantime? I'm out of > ideas. > > Thanks, > Nilesh > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6791.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. >

Re: all values for a key must fit in memory

2014-05-25 Thread Nilesh
iew this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6791.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: all values for a key must fit in memory

2014-04-21 Thread Sandy Ryza
Thanks Matei and Mridul - was basically wondering whether we would be able to change the shuffle to accommodate this after 1.0, and from your answers it sounds like we can. On Mon, Apr 21, 2014 at 12:31 AM, Mridul Muralidharan wrote: > As Matei mentioned, the Values is now an Iterable : which ca

Re: all values for a key must fit in memory

2014-04-21 Thread Mridul Muralidharan
As Matei mentioned, the Values is now an Iterable : which can be disk backed. Does that not address the concern ? @Patrick - we do have cases where the length of the sequence is large and size per value is also non trivial : so we do need this :-) Note that join is a trivial example where this is

Re: all values for a key must fit in memory

2014-04-20 Thread Patrick Wendell
Just wanted to mention - one common thing I've seen users do is use groupByKey, then do something that is commutitive and associative once the values are grouped. Really users here should be doing reduceByKey. rdd.groupByKey().map{ case (key, values) => (key, values.sum)) rdd.reduceByKey(_ + _) I

Re: all values for a key must fit in memory

2014-04-20 Thread Matei Zaharia
We’ve updated the user-facing API of groupBy in 1.0 to allow this: https://issues.apache.org/jira/browse/SPARK-1271. The ShuffleFetcher API is internal to Spark, it doesn’t really matter what it is because we can change it. But the problem before was that groupBy and cogroup were defined as ret

Re: all values for a key must fit in memory

2014-04-20 Thread Sandy Ryza
The issue isn't that the Iterator[P] can't be disk-backed. It's that, with a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read into memory at once. The ShuffledRDD is agnostic to what goes inside P. On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan wrote: > An iterator

Re: all values for a key must fit in memory

2014-04-20 Thread Mridul Muralidharan
An iterator does not imply data has to be memory resident. Think merge sort output as an iterator (disk backed). Tom is actually planning to work on something similar with me on this hopefully this or next month. Regards, Mridul On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza wrote: > Hey all, >

all values for a key must fit in memory

2014-04-20 Thread Sandy Ryza
Hey all, After a shuffle / groupByKey, Hadoop MapReduce allows the values for a key to not all fit in memory. The current ShuffleFetcher.fetch API, which doesn't distinguish between keys and values, only returning an Iterator[P], seems incompatible with this. Any thoughts on how we could achieve