Edit: the first line should read:

  val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)

On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza <sandy.r...@cloudera.com>
wrote:

> This functionality already basically exists in Spark.  To create the
> "grouped RDD", one can run:
>
>   val groupedRdd = rdd.reduceByKey(_ + _)
>
> To get it back into the original form:
>
>   groupedRdd.flatMap(x => List.fill(x._1)(x._2))
>
> -Sandy
>
> -Sandy
>
> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am looking for suitable issue for Master Degree project(it sounds like
>> scalability problems and improvements for spark streaming) and seems like
>> introduction of grouped RDD(for example: don't store
>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>>
>> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of uniq
>> messages)
>> 2. Improve performance(no need to apply function several times for the
>> same message).
>>
>> Can I create ticket and introduce API for grouped RDDs? Is it make sense?
>> Also I will be very appreciated for critic and ideas
>>
>
>

Reply via email to