Iterator implementation works OK for me here, though it might
turn out to be slow.
Cheers,
Nilesh
PS: Can't wait for 1.0! ^_^ Looks like it's been RC10 till now.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-
gt;
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6794.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
om 1.0, the logic of which thankfully works
with 0.9.1 too, no new API changes there.
Cheers,
Nilesh
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6794.html
Sent from the Apache Spark Developers List mai
propose a workaround for this for the meantime? I'm out of
> ideas.
>
> Thanks,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6791.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
iew this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6791.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Thanks Matei and Mridul - was basically wondering whether we would be able
to change the shuffle to accommodate this after 1.0, and from your answers
it sounds like we can.
On Mon, Apr 21, 2014 at 12:31 AM, Mridul Muralidharan wrote:
> As Matei mentioned, the Values is now an Iterable : which ca
As Matei mentioned, the Values is now an Iterable : which can be disk backed.
Does that not address the concern ?
@Patrick - we do have cases where the length of the sequence is large
and size per value is also non trivial : so we do need this :-)
Note that join is a trivial example where this is
Just wanted to mention - one common thing I've seen users do is use
groupByKey, then do something that is commutitive and associative once the
values are grouped. Really users here should be doing reduceByKey.
rdd.groupByKey().map{ case (key, values) => (key, values.sum))
rdd.reduceByKey(_ + _)
I
We’ve updated the user-facing API of groupBy in 1.0 to allow this:
https://issues.apache.org/jira/browse/SPARK-1271. The ShuffleFetcher API is
internal to Spark, it doesn’t really matter what it is because we can change
it. But the problem before was that groupBy and cogroup were defined as
ret
The issue isn't that the Iterator[P] can't be disk-backed. It's that, with
a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
into memory at once. The ShuffledRDD is agnostic to what goes inside P.
On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan wrote:
> An iterator
An iterator does not imply data has to be memory resident.
Think merge sort output as an iterator (disk backed).
Tom is actually planning to work on something similar with me on this
hopefully this or next month.
Regards,
Mridul
On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza wrote:
> Hey all,
>
Hey all,
After a shuffle / groupByKey, Hadoop MapReduce allows the values for a key
to not all fit in memory. The current ShuffleFetcher.fetch API, which
doesn't distinguish between keys and values, only returning an Iterator[P],
seems incompatible with this.
Any thoughts on how we could achieve
12 matches
Mail list logo