To distill this a bit further, I don't think you actually want rdd2 to
wait on rdd1 in this case. What you want is for a request for
partition X to wait if partition X is already being calculated in a
persisted RDD. Otherwise the first partition of rdd2 waits on the
final partition of rdd1 even when the rest is ready.

That is probably usually a good idea in almost all cases. That much, I
don't know how hard it is to implement. But I speculate that it's
easier to deal with it at that level than as a function of the
dependency graph.

On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet <cjno...@gmail.com> wrote:
> I'm trying to do the scheduling myself now- to determine that rdd2 depends
> on rdd1 and rdd1 is a persistent RDD (storage level != None) so that I can
> do the no-op on rdd1 before I run rdd2. I would much rather the DAG figure
> this out so I don't need to think about all this.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to