Hi, My two cents is that that could be interesting if all RDD and pair RDD operations would be lifted to work on groupedRDD. For example as suggested a map on grouped RDDs would be more efficient if the original RDD had lots of duplicate entries, but for RDDs with little repetitions I guess you in fact lose efficiency. The same applies to filter, sortBy, count, max, ... but for example I guess there is no gain for reduce and other operations. Also note the order is lost when passing to grouped RDD, so the semantics is not exactly the same, but would be good enough for many applications. Also I would look for suitable use cases where RDD with many repetitions arise naturally, and the transformations with performance gain like map are used often, and I would do some experiments to compare performance between a computation with grouped RDD and the same computation without grouping, for different input sizes
El domingo, 19 de julio de 2015, Sandy Ryza <sandy.r...@cloudera.com> escribió: > This functionality already basically exists in Spark. To create the > "grouped RDD", one can run: > > val groupedRdd = rdd.reduceByKey(_ + _) > > To get it back into the original form: > > groupedRdd.flatMap(x => List.fill(x._1)(x._2)) > > -Sandy > > -Sandy > > On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com > <javascript:_e(%7B%7D,'cvml','sergliho...@gmail.com');>> wrote: > >> Hi, >> >> I am looking for suitable issue for Master Degree project(it sounds like >> scalability problems and improvements for spark streaming) and seems like >> introduction of grouped RDD(for example: don't store >> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can: >> >> 1. Reduce memory needed for RDD (roughly, used memory will be: % of uniq >> messages) >> 2. Improve performance(no need to apply function several times for the >> same message). >> >> Can I create ticket and introduce API for grouped RDDs? Is it make sense? >> Also I will be very appreciated for critic and ideas >> > >