Just wanted to mention - one common thing I've seen users do is use
groupByKey, then do something that is commutitive and associative once the
values are grouped. Really users here should be doing reduceByKey.
rdd.groupByKey().map{ case (key, values) => (key, values.sum))
rdd.reduceByKey(_ + _)
I
We’ve updated the user-facing API of groupBy in 1.0 to allow this:
https://issues.apache.org/jira/browse/SPARK-1271. The ShuffleFetcher API is
internal to Spark, it doesn’t really matter what it is because we can change
it. But the problem before was that groupBy and cogroup were defined as
ret
The issue isn't that the Iterator[P] can't be disk-backed. It's that, with
a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
into memory at once. The ShuffledRDD is agnostic to what goes inside P.
On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan wrote:
> An iterator
An iterator does not imply data has to be memory resident.
Think merge sort output as an iterator (disk backed).
Tom is actually planning to work on something similar with me on this
hopefully this or next month.
Regards,
Mridul
On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza wrote:
> Hey all,
>
Hey all,
After a shuffle / groupByKey, Hadoop MapReduce allows the values for a key
to not all fit in memory. The current ShuffleFetcher.fetch API, which
doesn't distinguish between keys and values, only returning an Iterator[P],
seems incompatible with this.
Any thoughts on how we could achieve
replacing the jar is not enough.
You have to change protobuf dependency in shark's build script. and
recompile the source.
Protobuf 2.4.1 and 2.5.0 is not binary compatible.
On Sun, Apr 20, 2014 at 6:45 PM, qingyang li wrote:
> shark 0.9.1 is using protobuf 2.4.1 , but hadoop2.2.0 is using
> pr
bq. I have tried replace protobuf2.4.1 in shark with protobuf2.5.0
Did you replace the jar file or did you change the following in pom.xml and
rebuild ?
2.4.1
Cheers
On Sun, Apr 20, 2014 at 3:45 AM, qingyang li wrote:
> shark 0.9.1 is using protobuf 2.4.1 , but hadoop2.2.0 is using
> proto
shark 0.9.1 is using protobuf 2.4.1 , but hadoop2.2.0 is using
protobuf2.5.0,
how can we make them work together?
I have tried replace protobuf2.4.1 in shark with protobuf2.5.0, it does not
work.
I have also tried replacing protobuf2.5.0 in hadoop with shark's 2.4.1, it
does not work too.