Zhan, I think it might be helpful to point out that I'm trying to run the RDDs in different threads to maximize the amount of work that can be done concurrently. Unfortunately, right now if I had something like this:
val rdd1 = ......cache() val rdd2 = rdd1.map().....() future { rdd1.saveAsHasoopFile(...) } future { rdd2.saveAsHadoopFile(...)] The final result is that rdd1 is calculated twice. My dataset is several 100's of gigs and I cannot wait for things to be calculated multiple times. I've tried been running the saveAsHadoopFile() operations in a single thread but I'm finding too much time being spent wioth no tasks running when I know I could be better saturating the resources of the cluster. I'm trying to do the scheduling myself now- to determine that rdd2 depends on rdd1 and rdd1 is a persistent RDD (storage level != None) so that I can do the no-op on rdd1 before I run rdd2. I would much rather the DAG figure this out so I don't need to think about all this. On Thu, Feb 26, 2015 at 5:43 PM, Zhan Zhang <zzh...@hortonworks.com> wrote: > You don’t need to know rdd dependencies to maximize dependencies. > Internally the scheduler will construct the DAG and trigger the execution > if there is no shuffle dependencies in between RDDs. > > Thanks. > > Zhan Zhang > On Feb 26, 2015, at 1:28 PM, Corey Nolet <cjno...@gmail.com> wrote: > > > Let's say I'm given 2 RDDs and told to store them in a sequence file and > they have the following dependency: > > > > val rdd1 = sparkContext.sequenceFile().....cache() > > val rdd2 = rdd1.map(....).... > > > > > > How would I tell programmatically without being the one who built rdd1 > and rdd2 whether or not rdd2 depends on rdd1? > > > > I'm working on a concurrency model for my application and I won't > necessarily know how the two rdds are constructed. What I will know is > whether or not rdd1 is cached but i want to maximum concurrency and run > rdd1 and rdd2 together if rdd2 does not depend on rdd1. > > > >