Re: Understanding Spark's caching

Akhil Das Tue, 28 Apr 2015 03:05:58 -0700

Option B would be fine, as in the SO itself the answer says, Since RDD
transformations merely build DAG descriptions without execution, in Option
A by the time you call unpersist, you still only have job descriptions and
not a running execution.


Also note, In Option A, you are not specifying any action anywhere.

Thanks
Best Regards

On Tue, Apr 28, 2015 at 12:58 AM, Eran Medan <eran.me...@gmail.com> wrote:

> Hi Everyone!
>
> I'm trying to understand how Spark's cache work.
>
> Here is my naive understanding, please let me know if I'm missing
> something:
>
> val rdd1 = sc.textFile("some data")
> rdd.cache() //marks rdd as cached
> val rdd2 = rdd1.filter(...)
> val rdd3 = rdd1.map(...)
> rdd2.saveAsTextFile("...")
> rdd3.saveAsTextFile("...")
>
> In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when
> rdd2 is saved I assume) and then from cache (assuming there is enough RAM)
> when rdd3 is saved)
>
> Now here is my question. Let's say I want to cache rdd2 and rdd3 as they
> will both be used later on, but I don't need rdd1 after creating them.
>
> Basically there is duplication, isn't it? Since once rdd2 and rdd3 are
> calculated, I don't need rdd1 anymore, I should probably unpersist it,
> right? the question is when?
>
> *Will this work? (Option A)*
>
> val rdd1 = sc.textFile("some data")
> rdd.cache() //marks rdd as cached
> val rdd2 = rdd1.filter(...)
> val rdd3 = rdd1.map(...)
> rdd2.cache()
> rdd3.cache()
> rdd1.unpersist()
>
> Does spark add the unpersist call to the DAG? or is it done immediately?
> if it's done immediately, then basically rdd1 will be non cached when I
> read from rdd2 and rdd3, right?
>
> *Should I do it this way instead (Option B)?*
>
> val rdd1 = sc.textFile("some data")
> rdd.cache() //marks rdd as cached
> val rdd2 = rdd1.filter(...)
> val rdd3 = rdd1.map(...)
>
> rdd2.cache()
> rdd3.cache()
>
> rdd2.saveAsTextFile("...")
> rdd3.saveAsTextFile("...")
>
> rdd1.unpersist()
>
> *So the question is this:* Is Option A good enough? e.g. will rdd1 be
> still accessing the file only once? Or do I need to go with Option B?
>
> (see also
> http://stackoverflow.com/questions/29903675/understanding-sparks-caching)
>
> Thanks in advance
>

Re: Understanding Spark's caching

Reply via email to