This functionality already basically exists in Spark. To create the "grouped RDD", one can run:
val groupedRdd = rdd.reduceByKey(_ + _) To get it back into the original form: groupedRdd.flatMap(x => List.fill(x._1)(x._2)) -Sandy -Sandy On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com> wrote: > Hi, > > I am looking for suitable issue for Master Degree project(it sounds like > scalability problems and improvements for spark streaming) and seems like > introduction of grouped RDD(for example: don't store > "Spark", "Spark", "Spark", instead store ("Spark", 3)) can: > > 1. Reduce memory needed for RDD (roughly, used memory will be: % of uniq > messages) > 2. Improve performance(no need to apply function several times for the > same message). > > Can I create ticket and introduce API for grouped RDDs? Is it make sense? > Also I will be very appreciated for critic and ideas >