The issue is that both RDDs are being evaluated at once. rdd1 is
cached, which means that as its partitions are evaluated, they are
persisted. Later requests for the partition hit the cached partition.
But we have two threads causing two jobs to evaluate partitions of
rdd1 at the same time. If they both reach for the same partition at
once, both will evaluate the partition. It's not sure that rdd1 is
completely evaluated twice but it probably will mostly be.

If rdd1 is evaluated by itself first (i.e. no futures, just a serial
program) then rdd2 would completely use the cached rdd1.

On Thu, Feb 26, 2015 at 11:10 PM, Zhan Zhang <zzh...@hortonworks.com> wrote:
> What confused me is  the statement of "The final result is that rdd1 is
> calculated twice.” Is it the expected behavior?
>
> Thanks.
>
> Zhan Zhang
>
> On Feb 26, 2015, at 3:03 PM, Sean Owen <so...@cloudera.com> wrote:
>
> To distill this a bit further, I don't think you actually want rdd2 to
> wait on rdd1 in this case. What you want is for a request for
> partition X to wait if partition X is already being calculated in a
> persisted RDD. Otherwise the first partition of rdd2 waits on the
> final partition of rdd1 even when the rest is ready.
>
> That is probably usually a good idea in almost all cases. That much, I
> don't know how hard it is to implement. But I speculate that it's
> easier to deal with it at that level than as a function of the
> dependency graph.
>
> On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet <cjno...@gmail.com> wrote:
>
> I'm trying to do the scheduling myself now- to determine that rdd2 depends
> on rdd1 and rdd1 is a persistent RDD (storage level != None) so that I can
> do the no-op on rdd1 before I run rdd2. I would much rather the DAG figure
> this out so I don't need to think about all this.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to