Hi Daniel,

Yes that should work also. However, is it possible to setup so that each
RDD has exactly one partition, without repartitioning (and thus incurring
extra cost)? Is there a mechanism similar to MR where we can ensure each
partition is assigned some amount of data by size, by setting some block
size parameter?



On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann <[email protected]>
wrote:

> On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia <
> [email protected]> wrote
>
>>
>> No i don't want separate RDD because each of these partitions are being
>> processed the same way (in my case, each partition corresponds to HBase
>> keys belonging to one region server, and i will do HBase lookups). After
>> that i have aggregations too, hence all these partitions should be in the
>> same RDD. The reason to follow the partition structure is to limit
>> concurrent HBase lookups targeting a single region server.
>>
>
> Neither of these is necessarily a barrier to using separate RDDs. You can
> define the function you want to use and then pass it to multiple map
> methods. Then you could union all the RDDs to do your aggregations. For
> example, it might look something like this:
>
> val paths: String = ... // the paths to the files you want to load
> def myFunc(t: T) = ... // the function to apply to every RDD
> val rdds = paths.map { path =>
>     sc.textFile(path).map(myFunc)
> }
> val completeRdd = sc.union(rdds)
>
> Does that make any sense?
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 54 W 40th St, New York, NY 10018
> E: [email protected] W: www.velos.io
>

Reply via email to