Re: Assigning input files to spark partitions

Pala M Muthaia Mon, 17 Nov 2014 11:57:06 -0800

Hi Daniel,

Yes that should work also. However, is it possible to setup so that each
RDD has exactly one partition, without repartitioning (and thus incurring
extra cost)? Is there a mechanism similar to MR where we can ensure each
partition is assigned some amount of data by size, by setting some block
size parameter?




On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann <[email protected]>
wrote:

> On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia <
> [email protected]> wrote
>
>>
>> No i don't want separate RDD because each of these partitions are being
>> processed the same way (in my case, each partition corresponds to HBase
>> keys belonging to one region server, and i will do HBase lookups). After
>> that i have aggregations too, hence all these partitions should be in the
>> same RDD. The reason to follow the partition structure is to limit
>> concurrent HBase lookups targeting a single region server.
>>
>
> Neither of these is necessarily a barrier to using separate RDDs. You can
> define the function you want to use and then pass it to multiple map
> methods. Then you could union all the RDDs to do your aggregations. For
> example, it might look something like this:
>
> val paths: String = ... // the paths to the files you want to load
> def myFunc(t: T) = ... // the function to apply to every RDD
> val rdds = paths.map { path =>
>     sc.textFile(path).map(myFunc)
> }
> val completeRdd = sc.union(rdds)
>
> Does that make any sense?
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 54 W 40th St, New York, NY 10018
> E: [email protected] W: www.velos.io
>

Re: Assigning input files to spark partitions

Reply via email to