Hi there, You should be selective about which RDDs you cache and which you don't. A good candidate RDD for caching is one that you reuse multiple times. Commonly the reuse is for iterative machine learning algorithms that need to take multiple passes over the data.
If you try to cache a really large RDD, Spark may evict older RDDs out of memory to make room for the new one. So, that's another reason to be careful about which RDDs you cache. >> Is it a correct conclusion that it doesn't matter if ".cache" is used anywhere in the program if I only have one action that is called only once? Maybe. Sometimes you may call an action that triggers a large DAG with 200 RDDs to materialize. Inside that DAG there may be ML algorithms or reuse of an RDD multiple times for joins with other datasets. In these cases, even though you're calling just one action, it would make sense to cache certain, strategic RDDs. But with regards to your specific question... >>Related to this question, consider this situation: val d1 = data.map((x,y,z) => (x,y)) val d2 = data.map((x,y,z) => (y,x)) >> I'm wondering if Spark is optimizing the execution in a way that the mappers for d1 and d2 are running in parallel and the data RDD is traversed only once. Here caching doesn't really help. Spark would be smart enough to realize that both maps can be pipelined together in one thread/task. So, if the 'data' RDD has 5 partitions, you would just need 5 threads to apply both maps (not 10 threads). When you call an action, the DAG gets broken down into Stages. Sometimes a prior stage has to completely finish before running the next stage. Inside a Stage there are multiple tasks/threads, one for each partition. Usually a wide dependency relationship between two RDDs defines the stage boundary. A wide dependency means that a network shuffle operation has to take place between the two stages. As Bojan said, you can call the .toDebugString() method on an RDD to begin understanding how the DAG that generates that specific RDD breaks down into different stages of execution. On Thu, Apr 9, 2015 at 1:58 AM, Bojan Kostic <blood9ra...@gmail.com> wrote: > You can use toDebugString to see all the steps in job. > > Best > Bojan > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >