Hi,

On Sat, Jan 17, 2015 at 3:37 AM, Peng Cheng <[email protected]> wrote:

> I'm talking about RDD1 (not persisted or checkpointed) in this situation:
>
> ...(somewhere) -> RDD1 -> RDD2
>                               |                |
>                              V               V
>                              RDD3 -> RDD4 -> Action!
>
> To my experience the change RDD1 get recalculated is volatile, sometimes
> once, sometimes twice.


That should not happen if your access pattern to RDD2 and RDD3 is always
the same.

A related problem might be in $SQLContest.jsonRDD(), since the source
> jsonRDD is used twice (one for schema inferring, another for data read). It
> almost guarantees that the source jsonRDD is calculated twice. Has this
> problem be addressed so far?
>

That's exactly why schema inference is expensive. However, I am afraid in
general you have to make a decision between "store" or "recompute" (cf.
http://en.wikipedia.org/wiki/Space%E2%80%93time_tradeoff). There is no way
to avoid recomputation on each access except than storing the value, I
guess.

Tobias

Reply via email to