Hi Amit,

One action might use the same DataFrame more than once. You can look at your 
LogicalPlan by executing DF3.explain (arguments different depending the version 
of Spark you are using) and see how many times you need to compute DF2 or DF1. 
Given the information you have provided I suspect that DF1 is used more than 
once (one time at  DF2 and another one at DF3). So, Spark is going to cache it 
the first time and it will load it from cache instead of running it again the 
second time.

I hope this helped,
Theo.

From: Amit Sharma <resolve...@gmail.com>
Sent: Monday, December 7, 2020 11:32 AM
To: user@spark.apache.org
Subject: Caching

Hi All, I am using caching in my code. I have a DF like
val  DF1 = read csv.
val DF2 = DF1.groupBy().agg().select(.....)

Val DF3 =  read csv .join(DF1).join(DF2)
  DF3 .save.

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 
action only why do I need to cache.

Thanks
Amit


Reply via email to