If you don't want to recalculate you need to hold the results somewhere, of you need to save it why don't you so that and then read it again and get your stats?
On Fri, 17 Nov 2017, 10:03 Fernando Pereira, <ferdonl...@gmail.com> wrote: > Dear Spark users > > Is it possible to take the output of a transformation (RDD/Dataframe) and > feed it to two independent transformations without recalculating the first > transformation and without caching the whole dataset? > > Consider the case of a very large dataset (1+TB) which suffered several > transformations and now we want to save it but also calculate some > statistics per group. > So the best processing way would for: for each partition: do task A, do > task B. > > I don't see a way of instructing spark how to proceed that way without > caching to disk, which seems unnecessarily heavy. And if we don't cache > spark recalculates every partition all the way from the beginning. In > either case huge file reads happen. > > Any ideas on how to avoid it? > > Thanks > > Fernando >