I understand that RDDs are not created until an action is called. Is it a
correct conclusion that it doesn't matter if ".cache" is used anywhere in
the program if I only have one action that is called only once?

Related to this question, consider this situation: 
val d1 = data.map((x,y,z) => (x,y))
val d2 = data.map((x,y,z) => (y,x))

I'm wondering if Spark is optimizing the execution in a way that the mappers
for d1 and d2 are running in parallel and the data RDD is traversed only
once.

If that is not the case, would it make a difference to cache the data RDD,
like this:
data.cache()
val d1 = data.map((x,y,z) => (x,y))
val d2 = data.map((x,y,z) => (y,x))

Furthermore, consider:
val d3 = d2.map((x,y) => (y,x))

d2 and d3 are equivalent. What implementation should be preferred?

Thx.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to