Thanks a lot. Yes, this mapPartitions seems a better way of dealing with this
problem as for groupBy() I need to collect() data before applying
parallelize(), which is expensive.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Grouping-tp12407p12424.html
groupBy seems to be exactly what you want.
val data = sc.parallelize(1 to 200)
data.groupBy(_ % 10).values.map(...)
This would let you process 10 Iterable[Int] in parallel, each of which
is 20 ints in this example.
It may not make sense to do this in practice, as you'd be shuffling a
lot of data