Re: Caching

Lalwani, Jayesh Mon, 07 Dec 2020 11:28:40 -0800

  *   Jayesh, but during logical plan spark would be knowing to use the same DF 
twice so it will optimize the query.


No. That would mean that Spark will need to cache DF1. Spark won’t cache 
dataframes unless you ask it to, even if it knows that the same dataframe is 
being used twice. This is because caching data frames introduces memory 
overheads, and it’s not going to prematurely do it. It will combine processing 
of various dataframes within a stage. However, in your case, you are doing 
aggregation which will create a new stage

You can check the execution plan if you like

From: Amit Sharma <resolve...@gmail.com>
Reply-To: "resolve...@gmail.com" <resolve...@gmail.com>
Date: Monday, December 7, 2020 at 1:47 PM
To: "Lalwani, Jayesh" <jlalw...@amazon.com>, "user@spark.apache.org" 
<user@spark.apache.org>
Subject: RE: [EXTERNAL] Caching


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Jayesh, but during logical plan spark would be knowing to use the same DF twice 
so it will optimize the query.


Thanks
Amit

On Mon, Dec 7, 2020 at 1:16 PM Lalwani, Jayesh 
<jlalw...@amazon.com<mailto:jlalw...@amazon.com>> wrote:
Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, 
without caching,  Spark will read the CSV twice: Once to load it for DF1, and 
once to load it for DF2. When you add a cache on DF1 or DF2, it reads from CSV 
only once.

You might want to look at doing a windowed  query on DF1 to avoid joining DF1 
with DF2. This should give you better or similar  performance when compared to  
cache because Spark will optimize for cache the data during shuffle.

From: Amit Sharma <resolve...@gmail.com<mailto:resolve...@gmail.com>>
Reply-To: "resolve...@gmail.com<mailto:resolve...@gmail.com>" 
<resolve...@gmail.com<mailto:resolve...@gmail.com>>
Date: Monday, December 7, 2020 at 12:47 PM
To: Theodoros Gkountouvas 
<theo.gkountou...@futurewei.com<mailto:theo.gkountou...@futurewei.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: RE: [EXTERNAL] Caching


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Thanks for the information. I am using  spark 2.3.3 There are few more questions

1. Yes I am using DF1 two times but at the end action is one on DF3. In that 
case action of DF1 should be just 1 or it depends how many times this dataframe 
is used in transformation.

I believe even if we use a dataframe multiple times for transformation , use 
caching should be based on actions. In my case action is one save call on DF3. 
Please correct me if i am wrong.

Thanks
Amit

On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas 
<theo.gkountou...@futurewei.com<mailto:theo.gkountou...@futurewei.com>> wrote:
Hi Amit,

One action might use the same DataFrame more than once. You can look at your 
LogicalPlan by executing DF3.explain (arguments different depending the version 
of Spark you are using) and see how many times you need to compute DF2 or DF1. 
Given the information you have provided I suspect that DF1 is used more than 
once (one time at  DF2 and another one at DF3). So, Spark is going to cache it 
the first time and it will load it from cache instead of running it again the 
second time.

I hope this helped,
Theo.

From: Amit Sharma <resolve...@gmail.com<mailto:resolve...@gmail.com>>
Sent: Monday, December 7, 2020 11:32 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Caching

Hi All, I am using caching in my code. I have a DF like
val  DF1 = read csv.
val DF2 = DF1.groupBy().agg().select(.....)

Val DF3 =  read csv .join(DF1).join(DF2)
  DF3 .save.

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 
action only why do I need to cache.

Thanks
Amit

Re: Caching

Reply via email to