The RDD API has functions to join multiple RDDs, such as PariRDD.join
or PariRDD.cogroup that take another RDD as input. e.g.
firstRDD.join(secondRDD)
I'm looking for ways to do the opposite: split an existing RDD. What is the
right way to create derivate RDDs from an existing RDD?
e.g. imagine I've an collection or pairs as input: colRDD =
(k1->v1)...(kx->vy)...
I could do:
val byKey = colRDD.groupByKey() = (k1->(k1->v1... k1->vn)),...(kn->(kn->vy,
...))
Now, I'd like to create an RDD from the values to have something like:
val groupedRDDs = (k1->RDD(k1->v1,...k1->vn), kn -> RDD(kn->vy, ...))
in this example, there's an f(byKey) = groupedRDDs. What's that f(x) ?
Would: byKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)} the
right/recommended way to do this? Any other options?
Thanks,
Gerard.