To distill this a bit further, I don't think you actually want rdd2 to wait on rdd1 in this case. What you want is for a request for partition X to wait if partition X is already being calculated in a persisted RDD. Otherwise the first partition of rdd2 waits on the final partition of rdd1 even when the rest is ready.
That is probably usually a good idea in almost all cases. That much, I don't know how hard it is to implement. But I speculate that it's easier to deal with it at that level than as a function of the dependency graph. On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet <cjno...@gmail.com> wrote: > I'm trying to do the scheduling myself now- to determine that rdd2 depends > on rdd1 and rdd1 is a persistent RDD (storage level != None) so that I can > do the no-op on rdd1 before I run rdd2. I would much rather the DAG figure > this out so I don't need to think about all this. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org