No, it's not true that one action means every DF is evaluated once. This is a good counterexample.
On Mon, Dec 7, 2020 at 11:47 AM Amit Sharma <resolve...@gmail.com> wrote: > Thanks for the information. I am using spark 2.3.3 There are few more > questions > > 1. Yes I am using DF1 two times but at the end action is one on DF3. In > that case action of DF1 should be just 1 or it depends how many times this > dataframe is used in transformation. > > I believe even if we use a dataframe multiple times for transformation , > use caching should be based on actions. In my case action is one save call > on DF3. Please correct me if i am wrong. > > Thanks > Amit > > On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas < > theo.gkountou...@futurewei.com> wrote: > >> Hi Amit, >> >> >> >> One action might use the same DataFrame more than once. You can look at >> your LogicalPlan by executing DF3.explain (arguments different depending >> the version of Spark you are using) and see how many times you need to >> compute DF2 or DF1. Given the information you have provided I suspect that >> DF1 is used more than once (one time at DF2 and another one at DF3). So, >> Spark is going to cache it the first time and it will load it from cache >> instead of running it again the second time. >> >> >> >> I hope this helped, >> >> Theo. >> >> >> >> *From:* Amit Sharma <resolve...@gmail.com> >> *Sent:* Monday, December 7, 2020 11:32 AM >> *To:* user@spark.apache.org >> *Subject:* Caching >> >> >> >> Hi All, I am using caching in my code. I have a DF like >> >> val DF1 = read csv. >> >> val DF2 = DF1.groupBy().agg().select(.....) >> >> >> >> Val DF3 = read csv .join(DF1).join(DF2) >> >> DF3 .save. >> >> >> >> If I do not cache DF2 or Df1 it is taking longer time . But i am doing 1 >> action only why do I need to cache. >> >> >> >> Thanks >> >> Amit >> >> >> >> >> >