Hi Everyone!
I'm trying to understand how Spark's cache work.
Here is my naive understanding, please let me know if I'm missing something:
val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when
rdd2 is saved I assume) and then from cache (assuming there is enough RAM)
when rdd3 is saved)
Now here is my question. Let's say I want to cache rdd2 and rdd3 as they
will both be used later on, but I don't need rdd1 after creating them.
Basically there is duplication, isn't it? Since once rdd2 and rdd3 are
calculated, I don't need rdd1 anymore, I should probably unpersist it,
right? the question is when?
*Will this work? (Option A)*
val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()
Does spark add the unpersist call to the DAG? or is it done immediately? if
it's done immediately, then basically rdd1 will be non cached when I read
from rdd2 and rdd3, right?
*Should I do it this way instead (Option B)?*
val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
rdd1.unpersist()
*So the question is this:* Is Option A good enough? e.g. will rdd1 be still
accessing the file only once? Or do I need to go with Option B?
(see also
http://stackoverflow.com/questions/29903675/understanding-sparks-caching)
Thanks in advance