Re: all values for a key must fit in memory

2014-04-20 Thread Patrick Wendell
Just wanted to mention - one common thing I've seen users do is use groupByKey, then do something that is commutitive and associative once the values are grouped. Really users here should be doing reduceByKey. rdd.groupByKey().map{ case (key, values) => (key, values.sum)) rdd.reduceByKey(_ + _) I

Re: all values for a key must fit in memory

2014-04-20 Thread Matei Zaharia
We’ve updated the user-facing API of groupBy in 1.0 to allow this: https://issues.apache.org/jira/browse/SPARK-1271. The ShuffleFetcher API is internal to Spark, it doesn’t really matter what it is because we can change it. But the problem before was that groupBy and cogroup were defined as ret

Re: all values for a key must fit in memory

2014-04-20 Thread Sandy Ryza
The issue isn't that the Iterator[P] can't be disk-backed. It's that, with a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read into memory at once. The ShuffledRDD is agnostic to what goes inside P. On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan wrote: > An iterator

Re: all values for a key must fit in memory

2014-04-20 Thread Mridul Muralidharan
An iterator does not imply data has to be memory resident. Think merge sort output as an iterator (disk backed). Tom is actually planning to work on something similar with me on this hopefully this or next month. Regards, Mridul On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza wrote: > Hey all, >

all values for a key must fit in memory

2014-04-20 Thread Sandy Ryza
Hey all, After a shuffle / groupByKey, Hadoop MapReduce allows the values for a key to not all fit in memory. The current ShuffleFetcher.fetch API, which doesn't distinguish between keys and values, only returning an Iterator[P], seems incompatible with this. Any thoughts on how we could achieve

Re: does shark0.9.1 work well with hadoop2.2.0 ?

2014-04-20 Thread Gordon Wang
replacing the jar is not enough. You have to change protobuf dependency in shark's build script. and recompile the source. Protobuf 2.4.1 and 2.5.0 is not binary compatible. On Sun, Apr 20, 2014 at 6:45 PM, qingyang li wrote: > shark 0.9.1 is using protobuf 2.4.1 , but hadoop2.2.0 is using > pr

Re: does shark0.9.1 work well with hadoop2.2.0 ?

2014-04-20 Thread Ted Yu
bq. I have tried replace protobuf2.4.1 in shark with protobuf2.5.0 Did you replace the jar file or did you change the following in pom.xml and rebuild ? 2.4.1 Cheers On Sun, Apr 20, 2014 at 3:45 AM, qingyang li wrote: > shark 0.9.1 is using protobuf 2.4.1 , but hadoop2.2.0 is using > proto

does shark0.9.1 work well with hadoop2.2.0 ?

2014-04-20 Thread qingyang li
shark 0.9.1 is using protobuf 2.4.1 , but hadoop2.2.0 is using protobuf2.5.0, how can we make them work together? I have tried replace protobuf2.4.1 in shark with protobuf2.5.0, it does not work. I have also tried replacing protobuf2.5.0 in hadoop with shark's 2.4.1, it does not work too.