Re: Caching and Actions

Sameer Farooqui Thu, 09 Apr 2015 02:30:12 -0700

Hi there,

You should be selective about which RDDs you cache and which you don't. A
good candidate RDD for caching is one that you reuse multiple times.
Commonly the reuse is for iterative machine learning algorithms that need
to take multiple passes over the data.

If you try to cache a really large RDD, Spark may evict older RDDs out of
memory to make room for the new one. So, that's another reason to be
careful about which RDDs you cache.

>> Is it a correct conclusion that it doesn't matter if ".cache" is used
anywhere in the program if I only have one action that is called only once?

Maybe. Sometimes you may call an action that triggers a large DAG with 200
RDDs to materialize. Inside that DAG there may be ML algorithms or reuse of
an RDD multiple times for joins with other datasets. In these cases, even
though you're calling just one action, it would make sense to cache
certain, strategic RDDs.

But with regards to your specific question...

>>Related to this question, consider this situation:
val d1 = data.map((x,y,z) => (x,y))
val d2 = data.map((x,y,z) => (y,x))

>> I'm wondering if Spark is optimizing the execution in a way that the
mappers
for d1 and d2 are running in parallel and the data RDD is traversed only
once.

Here caching doesn't really help. Spark would be smart enough to realize
that both maps can be pipelined together in one thread/task. So, if the
'data' RDD has 5 partitions, you would just need 5 threads to apply both
maps (not 10 threads).

When you call an action, the DAG gets broken down into Stages. Sometimes a
prior stage has to completely finish before running the next stage. Inside
a Stage there are multiple tasks/threads, one for each partition. Usually a
wide dependency relationship between two RDDs defines the stage boundary. A
wide dependency means that a network shuffle operation has to take place
between the two stages.

As Bojan said, you can call the .toDebugString() method on an RDD to begin
understanding how the DAG that generates that specific RDD breaks down into
different stages of execution.

On Thu, Apr 9, 2015 at 1:58 AM, Bojan Kostic <blood9ra...@gmail.com> wrote:

> You can use toDebugString to see all the steps in job.
>
> Best
> Bojan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Caching and Actions

Reply via email to