Sorry, maybe I am saying something completely wrong...  we have a stream,
we digitize it to created rdd. rdd in this case will be just array of any.
than we apply transformation to create new grouped rdd and GC should remove
original rdd from memory(if we won't persist it). Will we have GC step in  val
groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) ? my suggestion is to
remove creation and reclaiming of unneeded rdd and create already grouped
one

2015-07-19 21:26 GMT+03:00 Sandy Ryza <sandy.r...@cloudera.com>:

> The user gets to choose what they want to reside in memory.  If they call
> rdd.cache() on the original RDD, it will be in memory.  If they call
> rdd.cache() on the compact RDD, it will be in memory.  If cache() is called
> on both, they'll both be in memory.
>
> -Sandy
>
> On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман <sergliho...@gmail.com>
> wrote:
>
>> Thanks for answer! Could you please answer for one more question? Will
>> we have in memory original rdd and grouped rdd in the same time?
>>
>> 2015-07-19 21:04 GMT+03:00 Sandy Ryza <sandy.r...@cloudera.com>:
>>
>>> Edit: the first line should read:
>>>
>>>   val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)
>>>
>>> On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza <sandy.r...@cloudera.com>
>>> wrote:
>>>
>>>> This functionality already basically exists in Spark.  To create the
>>>> "grouped RDD", one can run:
>>>>
>>>>   val groupedRdd = rdd.reduceByKey(_ + _)
>>>>
>>>> To get it back into the original form:
>>>>
>>>>   groupedRdd.flatMap(x => List.fill(x._1)(x._2))
>>>>
>>>> -Sandy
>>>>
>>>> -Sandy
>>>>
>>>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am looking for suitable issue for Master Degree project(it sounds
>>>>> like scalability problems and improvements for spark streaming) and seems
>>>>> like introduction of grouped RDD(for example: don't store
>>>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>>>>>
>>>>> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of
>>>>> uniq messages)
>>>>> 2. Improve performance(no need to apply function several times for the
>>>>> same message).
>>>>>
>>>>> Can I create ticket and introduce API for grouped RDDs? Is it make
>>>>> sense? Also I will be very appreciated for critic and ideas
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to