Hi Daniel, Yes that should work also. However, is it possible to setup so that each RDD has exactly one partition, without repartitioning (and thus incurring extra cost)? Is there a mechanism similar to MR where we can ensure each partition is assigned some amount of data by size, by setting some block size parameter?
On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann <[email protected]> wrote: > On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia < > [email protected]> wrote > >> >> No i don't want separate RDD because each of these partitions are being >> processed the same way (in my case, each partition corresponds to HBase >> keys belonging to one region server, and i will do HBase lookups). After >> that i have aggregations too, hence all these partitions should be in the >> same RDD. The reason to follow the partition structure is to limit >> concurrent HBase lookups targeting a single region server. >> > > Neither of these is necessarily a barrier to using separate RDDs. You can > define the function you want to use and then pass it to multiple map > methods. Then you could union all the RDDs to do your aggregations. For > example, it might look something like this: > > val paths: String = ... // the paths to the files you want to load > def myFunc(t: T) = ... // the function to apply to every RDD > val rdds = paths.map { path => > sc.textFile(path).map(myFunc) > } > val completeRdd = sc.union(rdds) > > Does that make any sense? > > -- > Daniel Siegmann, Software Developer > Velos > Accelerating Machine Learning > > 54 W 40th St, New York, NY 10018 > E: [email protected] W: www.velos.io >
