Hi Juan, It's exactly what I meant. if we will have high load with many repetitions it can significantly reduce rdd size and improve performance. in real use cases application frequently need to enrich data from cache or external system, so we will save time on each repetition. I will also do some experiments. About little repetitions: in what use cases we will lose efficiency? it will also test it. What I need to do this commitment? Just create ticket in Jira?
2015-07-19 21:56 GMT+03:00 Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com>: > Hi, > > My two cents is that that could be interesting if all RDD and pair > RDD operations would be lifted to work on groupedRDD. For example as > suggested a map on grouped RDDs would be more efficient if the original RDD > had lots of duplicate entries, but for RDDs with little repetitions I guess > you in fact lose efficiency. The same applies to filter, sortBy, count, > max, ... but for example I guess there is no gain for reduce and other > operations. Also note the order is lost when passing to grouped RDD, so the > semantics is not exactly the same, but would be good enough for > many applications. Also I would look for suitable use cases where RDD with > many repetitions arise naturally, and the transformations with performance > gain like map are used often, and I would do some experiments to compare > performance between a computation with grouped RDD and the same computation > without grouping, for different input sizes > > > El domingo, 19 de julio de 2015, Sandy Ryza <sandy.r...@cloudera.com> > escribió: > >> This functionality already basically exists in Spark. To create the >> "grouped RDD", one can run: >> >> val groupedRdd = rdd.reduceByKey(_ + _) >> >> To get it back into the original form: >> >> groupedRdd.flatMap(x => List.fill(x._1)(x._2)) >> >> -Sandy >> >> -Sandy >> >> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I am looking for suitable issue for Master Degree project(it sounds like >>> scalability problems and improvements for spark streaming) and seems like >>> introduction of grouped RDD(for example: don't store >>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can: >>> >>> 1. Reduce memory needed for RDD (roughly, used memory will be: % of >>> uniq messages) >>> 2. Improve performance(no need to apply function several times for the >>> same message). >>> >>> Can I create ticket and introduce API for grouped RDDs? Is it make >>> sense? Also I will be very appreciated for critic and ideas >>> >> >>