Hi, On Sat, Jan 17, 2015 at 3:37 AM, Peng Cheng <[email protected]> wrote:
> I'm talking about RDD1 (not persisted or checkpointed) in this situation: > > ...(somewhere) -> RDD1 -> RDD2 > | | > V V > RDD3 -> RDD4 -> Action! > > To my experience the change RDD1 get recalculated is volatile, sometimes > once, sometimes twice. That should not happen if your access pattern to RDD2 and RDD3 is always the same. A related problem might be in $SQLContest.jsonRDD(), since the source > jsonRDD is used twice (one for schema inferring, another for data read). It > almost guarantees that the source jsonRDD is calculated twice. Has this > problem be addressed so far? > That's exactly why schema inference is expensive. However, I am afraid in general you have to make a decision between "store" or "recompute" (cf. http://en.wikipedia.org/wiki/Space%E2%80%93time_tradeoff). There is no way to avoid recomputation on each access except than storing the value, I guess. Tobias
