Hi Cheng,

Is it possibe to delete or replicate an rdd ??

> rdd1 = textFile("hdfs...").cache()
>
> rdd2 = rdd1.filter(userDefinedFunc1).cache()
> rdd3 = rdd1.filter(userDefinedFunc2).cache()

I reframe above question , if rdd1 is around 50G and after filtering its
come around say 4G.
then to increase computing performance we just cached it .. but rdd2 and
rdd3 are on disk ..
so this will show somehow show good performance than performing filter on
disk , then caching rdd2 and rdd3.

or can we also remove a particular rdd from cache say rdd1(if cached) after
filtered operation as its not required and we save memory usage.

Regards,
Arpit


On Tue, Apr 15, 2014 at 7:14 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

> Hi Joe,
>
> You need to make sure which RDD is used most frequently. In your case,
> rdd2 & rdd3 are filtered result of rdd1, so usually they are relatively
> smaller than rdd1, and it would be more reasonable to cache rdd2 and/or
> rdd3 if rdd1 is not referenced elsewhere.
>
> Say rdd1 takes 10G, rdd2 takes 1G after filtering, if you cache both of
> them, you end up with 11G memory consumption, which might not be what you
> want.
>
> Regards
> Cheng
>
>
> On Mon, Apr 14, 2014 at 8:32 PM, Joe L <selme...@yahoo.com> wrote:
>
>> Hi I am trying to cache 2Gbyte data and to implement the following
>> procedure.
>> In order to cache them I did as follows: Is it necessary to cache rdd2
>> since
>> rdd1 is already cached?
>>
>> rdd1 = textFile("hdfs...").cache()
>>
>> rdd2 = rdd1.filter(userDefinedFunc1).cache()
>> rdd3 = rdd1.filter(userDefinedFunc2).cache()
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Proper-caching-method-tp4206.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to