You need to first partition the data by the key
Use mappartition instead of map.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Fri, May 2, 2014 at 5:33 AM, Arun Swami <a...@caspida.com> wrote:

> Hi,
>
> I am a newbie to Spark. I looked for documentation or examples to answer
> my question but came up empty handed.
>
> I don't know whether I am using the right terminology but here goes.
>
> I have a file of records. Initially, I had the following Spark program (I
> am omitting all the surrounding code and focusing only on the Spark related
> code):
>
> ...
> val recordsRDD = sc.textFile(pathSpec, 2).cache
> val countsRDD: RDD[(String, Int)] = recordsRDD.flatMap(x =>
> getCombinations(x))
>   .map(e => (e, 1))
>   .reduceByKey(_ + _)
> ...
>
> Here getCombinations() is a function I have written that takes a record
> and returns List[String].
>
> This program works as expected.
>
> Now, I want to do the following. I want to partition the records in
> recordsRDD by some key extracted from each record. I do this as follows:
>
> val keyValueRecordsRDD: RDD[(String, String)] =
>   recodsRDD.flatMap(getKeyValueRecord(_))
>
> Here getKeyValueRecord() is a function I have written that takes a record
> and returns a Tuple2 of a key and the original record.
>
> Now I want to do the same operations as before (getCombinations(), and
> count occurrences) BUT on each partition as defined by the key.
> Essentially, I want to apply the operations individually in each partition.
> In a separate step, I want to recover the
> global counts across all partitions while keeping the partition based
> counts.
>
> How can I do this in Spark?
>
> Thanks!
>
> arun
> *______________*
> *Arun Swami*
>
>

Reply via email to