Option B would be fine, as in the SO itself the answer says, Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution.
Also note, In Option A, you are not specifying any action anywhere. Thanks Best Regards On Tue, Apr 28, 2015 at 12:58 AM, Eran Medan <eran.me...@gmail.com> wrote: > Hi Everyone! > > I'm trying to understand how Spark's cache work. > > Here is my naive understanding, please let me know if I'm missing > something: > > val rdd1 = sc.textFile("some data") > rdd.cache() //marks rdd as cached > val rdd2 = rdd1.filter(...) > val rdd3 = rdd1.map(...) > rdd2.saveAsTextFile("...") > rdd3.saveAsTextFile("...") > > In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when > rdd2 is saved I assume) and then from cache (assuming there is enough RAM) > when rdd3 is saved) > > Now here is my question. Let's say I want to cache rdd2 and rdd3 as they > will both be used later on, but I don't need rdd1 after creating them. > > Basically there is duplication, isn't it? Since once rdd2 and rdd3 are > calculated, I don't need rdd1 anymore, I should probably unpersist it, > right? the question is when? > > *Will this work? (Option A)* > > val rdd1 = sc.textFile("some data") > rdd.cache() //marks rdd as cached > val rdd2 = rdd1.filter(...) > val rdd3 = rdd1.map(...) > rdd2.cache() > rdd3.cache() > rdd1.unpersist() > > Does spark add the unpersist call to the DAG? or is it done immediately? > if it's done immediately, then basically rdd1 will be non cached when I > read from rdd2 and rdd3, right? > > *Should I do it this way instead (Option B)?* > > val rdd1 = sc.textFile("some data") > rdd.cache() //marks rdd as cached > val rdd2 = rdd1.filter(...) > val rdd3 = rdd1.map(...) > > rdd2.cache() > rdd3.cache() > > rdd2.saveAsTextFile("...") > rdd3.saveAsTextFile("...") > > rdd1.unpersist() > > *So the question is this:* Is Option A good enough? e.g. will rdd1 be > still accessing the file only once? Or do I need to go with Option B? > > (see also > http://stackoverflow.com/questions/29903675/understanding-sparks-caching) > > Thanks in advance >