Re: Compact RDD representation

Sandy Ryza Sun, 19 Jul 2015 11:27:06 -0700

The user gets to choose what they want to reside in memory.  If they call
rdd.cache() on the original RDD, it will be in memory.  If they call
rdd.cache() on the compact RDD, it will be in memory.  If cache() is called
on both, they'll both be in memory.


-Sandy

On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман <sergliho...@gmail.com>
wrote:

> Thanks for answer! Could you please answer for one more question? Will we
> have in memory original rdd and grouped rdd in the same time?
>
> 2015-07-19 21:04 GMT+03:00 Sandy Ryza <sandy.r...@cloudera.com>:
>
>> Edit: the first line should read:
>>
>>   val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)
>>
>> On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza <sandy.r...@cloudera.com>
>> wrote:
>>
>>> This functionality already basically exists in Spark.  To create the
>>> "grouped RDD", one can run:
>>>
>>>   val groupedRdd = rdd.reduceByKey(_ + _)
>>>
>>> To get it back into the original form:
>>>
>>>   groupedRdd.flatMap(x => List.fill(x._1)(x._2))
>>>
>>> -Sandy
>>>
>>> -Sandy
>>>
>>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking for suitable issue for Master Degree project(it sounds
>>>> like scalability problems and improvements for spark streaming) and seems
>>>> like introduction of grouped RDD(for example: don't store
>>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>>>>
>>>> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of
>>>> uniq messages)
>>>> 2. Improve performance(no need to apply function several times for the
>>>> same message).
>>>>
>>>> Can I create ticket and introduce API for grouped RDDs? Is it make
>>>> sense? Also I will be very appreciated for critic and ideas
>>>>
>>>
>>>
>>
>

Re: Compact RDD representation

Reply via email to