Hi,
I'm puzzling over the following problem: when I cache a small sample of a
big dataframe, the small dataframe is recomputed when selecting a column
(but not if show() or count() is invoked).
Why is that so and how can I avoid recomputation of the small sample
dataframe?
More details:
- I have a big dataframe "df" of ~190million rows and ~10 columns, obtained
via 3 different joins; I cache it and invoke count() to make sure it really
is in memory and confirm in web UI
- val sdf = df.sample(false, 1e-6); sdf.cache(); sdf.count() // 170 lines;
cached is also confirmed in webUI, size in memory is 150kB
*- sdf.select("colname").show() // this triggers a complete recomputation
of sdf with 3 joins!*
- show(), count() or take() do not trigger the recomputation of the 3
joins, but select(), collect() or withColumn() do.
I have --executor-memory 30G --driver-memory 10g, so memory is not a
problem. I'm using Spark 1.4.0. Could anybody shed some light on this or
where I can find more info?
Many thanks,
Kristina