I understand that RDDs are not created until an action is called. Is it a correct conclusion that it doesn't matter if ".cache" is used anywhere in the program if I only have one action that is called only once?
Related to this question, consider this situation: val d1 = data.map((x,y,z) => (x,y)) val d2 = data.map((x,y,z) => (y,x)) I'm wondering if Spark is optimizing the execution in a way that the mappers for d1 and d2 are running in parallel and the data RDD is traversed only once. If that is not the case, would it make a difference to cache the data RDD, like this: data.cache() val d1 = data.map((x,y,z) => (x,y)) val d2 = data.map((x,y,z) => (y,x)) Furthermore, consider: val d3 = d2.map((x,y) => (y,x)) d2 and d3 are equivalent. What implementation should be preferred? Thx. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org