The user gets to choose what they want to reside in memory. If they call rdd.cache() on the original RDD, it will be in memory. If they call rdd.cache() on the compact RDD, it will be in memory. If cache() is called on both, they'll both be in memory.
-Sandy On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман <sergliho...@gmail.com> wrote: > Thanks for answer! Could you please answer for one more question? Will we > have in memory original rdd and grouped rdd in the same time? > > 2015-07-19 21:04 GMT+03:00 Sandy Ryza <sandy.r...@cloudera.com>: > >> Edit: the first line should read: >> >> val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) >> >> On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza <sandy.r...@cloudera.com> >> wrote: >> >>> This functionality already basically exists in Spark. To create the >>> "grouped RDD", one can run: >>> >>> val groupedRdd = rdd.reduceByKey(_ + _) >>> >>> To get it back into the original form: >>> >>> groupedRdd.flatMap(x => List.fill(x._1)(x._2)) >>> >>> -Sandy >>> >>> -Sandy >>> >>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I am looking for suitable issue for Master Degree project(it sounds >>>> like scalability problems and improvements for spark streaming) and seems >>>> like introduction of grouped RDD(for example: don't store >>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can: >>>> >>>> 1. Reduce memory needed for RDD (roughly, used memory will be: % of >>>> uniq messages) >>>> 2. Improve performance(no need to apply function several times for the >>>> same message). >>>> >>>> Can I create ticket and introduce API for grouped RDDs? Is it make >>>> sense? Also I will be very appreciated for critic and ideas >>>> >>> >>> >> >