Hi Gerard, Usually when I want to split one RDD into several, I'm better off re-thinking the algorithm to do all the computation at once. Example:
Suppose you had a dataset that was the tuple (URL, webserver, pageSizeBytes), and you wanted to find out the average page size that each webserver (e.g. Apache, nginx, IIS, etc) served. Rather than splitting your allPagesRDD into an RDD for each webserver, like nginxRDD, apacheRDD, IISRDD, it's probably better to do the average computation over all at once, like this: // allPagesRDD is (URL, webserver, pageSizeBytes) allPagesRDD.keyBy(getWebserver) .map(k => (k.pageSizeBytes, 1)) .reduceByKey( (a,b) => (a._1 + b._1, a._2 + b._2) .mapValues( v => (v._1 / v._2) ) For this example you could use something like Summingbird to keep from doing the average tracking yourself. Can you go into more detail about why you want to split one RDD into several? On Mon, Jun 2, 2014 at 1:13 PM, Gerard Maas <gerard.m...@gmail.com> wrote: > The RDD API has functions to join multiple RDDs, such as PariRDD.join > or PariRDD.cogroup that take another RDD as input. e.g. > firstRDD.join(secondRDD) > > I'm looking for ways to do the opposite: split an existing RDD. What is > the right way to create derivate RDDs from an existing RDD? > > e.g. imagine I've an collection or pairs as input: colRDD = > (k1->v1)...(kx->vy)... > I could do: > val byKey = colRDD.groupByKey() = (k1->(k1->v1... > k1->vn)),...(kn->(kn->vy, ...)) > > Now, I'd like to create an RDD from the values to have something like: > > val groupedRDDs = (k1->RDD(k1->v1,...k1->vn), kn -> RDD(kn->vy, ...)) > > in this example, there's an f(byKey) = groupedRDDs. What's that f(x) ? > > Would: byKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)} the > right/recommended way to do this? Any other options? > > Thanks, > > Gerard. >