Re: RDD Grouping

2014-08-19 Thread TJ Klein
Thanks a lot. Yes, this mapPartitions seems a better way of dealing with this problem as for groupBy() I need to collect() data before applying parallelize(), which is expensive. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Grouping-tp12407p12424.html

Re: RDD Grouping

2014-08-19 Thread Sean Owen
groupBy seems to be exactly what you want. val data = sc.parallelize(1 to 200) data.groupBy(_ % 10).values.map(...) This would let you process 10 Iterable[Int] in parallel, each of which is 20 ints in this example. It may not make sense to do this in practice, as you'd be shuffling a lot of data