Re: How to tell if one RDD depends on another

Corey Nolet Thu, 26 Feb 2015 14:52:22 -0800

Zhan,

I think it might be helpful to point out that I'm trying to run the RDDs in
different threads to maximize the amount of work that can be done
concurrently. Unfortunately, right now if I had something like this:

val rdd1 = ......cache()
val rdd2 = rdd1.map().....()

future { rdd1.saveAsHasoopFile(...) }
future { rdd2.saveAsHadoopFile(...)]

The final result is that rdd1 is calculated twice. My dataset is several
100's of gigs and I cannot wait for things to be calculated multiple times.
I've tried been running the saveAsHadoopFile() operations in a single
thread but I'm finding too much time being spent wioth no tasks running
when I know I could be better saturating the resources of the cluster.

I'm trying to do the scheduling myself now- to determine that rdd2 depends
on rdd1 and rdd1 is a persistent RDD (storage level != None) so that I can
do the no-op on rdd1 before I run rdd2. I would much rather the DAG figure
this out so I don't need to think about all this.

On Thu, Feb 26, 2015 at 5:43 PM, Zhan Zhang <zzh...@hortonworks.com> wrote:

> You don’t need to know rdd dependencies to maximize dependencies.
> Internally the scheduler will construct the DAG and trigger the execution
> if there is no shuffle dependencies in between RDDs.
>
> Thanks.
>
> Zhan Zhang
> On Feb 26, 2015, at 1:28 PM, Corey Nolet <cjno...@gmail.com> wrote:
>
> > Let's say I'm given 2 RDDs and told to store them in a sequence file and
> they have the following dependency:
> >
> > val rdd1 = sparkContext.sequenceFile().....cache()
> > val rdd2 = rdd1.map(....)....
> >
> >
> > How would I tell programmatically without being the one who built rdd1
> and rdd2 whether or not rdd2 depends on rdd1?
> >
> > I'm working on a concurrency model for my application and I won't
> necessarily know how the two rdds are constructed. What I will know is
> whether or not rdd1 is cached but i want to maximum concurrency and run
> rdd1 and rdd2 together if rdd2 does not depend on rdd1.
> >
>
>

Re: How to tell if one RDD depends on another

Reply via email to