* Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query.
No. That would mean that Spark will need to cache DF1. Spark won’t cache dataframes unless you ask it to, even if it knows that the same dataframe is being used twice. This is because caching data frames introduces memory overheads, and it’s not going to prematurely do it. It will combine processing of various dataframes within a stage. However, in your case, you are doing aggregation which will create a new stage You can check the execution plan if you like From: Amit Sharma <resolve...@gmail.com> Reply-To: "resolve...@gmail.com" <resolve...@gmail.com> Date: Monday, December 7, 2020 at 1:47 PM To: "Lalwani, Jayesh" <jlalw...@amazon.com>, "user@spark.apache.org" <user@spark.apache.org> Subject: RE: [EXTERNAL] Caching CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query. Thanks Amit On Mon, Dec 7, 2020 at 1:16 PM Lalwani, Jayesh <jlalw...@amazon.com<mailto:jlalw...@amazon.com>> wrote: Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, without caching, Spark will read the CSV twice: Once to load it for DF1, and once to load it for DF2. When you add a cache on DF1 or DF2, it reads from CSV only once. You might want to look at doing a windowed query on DF1 to avoid joining DF1 with DF2. This should give you better or similar performance when compared to cache because Spark will optimize for cache the data during shuffle. From: Amit Sharma <resolve...@gmail.com<mailto:resolve...@gmail.com>> Reply-To: "resolve...@gmail.com<mailto:resolve...@gmail.com>" <resolve...@gmail.com<mailto:resolve...@gmail.com>> Date: Monday, December 7, 2020 at 12:47 PM To: Theodoros Gkountouvas <theo.gkountou...@futurewei.com<mailto:theo.gkountou...@futurewei.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: RE: [EXTERNAL] Caching CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Thanks for the information. I am using spark 2.3.3 There are few more questions 1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of DF1 should be just 1 or it depends how many times this dataframe is used in transformation. I believe even if we use a dataframe multiple times for transformation , use caching should be based on actions. In my case action is one save call on DF3. Please correct me if i am wrong. Thanks Amit On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <theo.gkountou...@futurewei.com<mailto:theo.gkountou...@futurewei.com>> wrote: Hi Amit, One action might use the same DataFrame more than once. You can look at your LogicalPlan by executing DF3.explain (arguments different depending the version of Spark you are using) and see how many times you need to compute DF2 or DF1. Given the information you have provided I suspect that DF1 is used more than once (one time at DF2 and another one at DF3). So, Spark is going to cache it the first time and it will load it from cache instead of running it again the second time. I hope this helped, Theo. From: Amit Sharma <resolve...@gmail.com<mailto:resolve...@gmail.com>> Sent: Monday, December 7, 2020 11:32 AM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Caching Hi All, I am using caching in my code. I have a DF like val DF1 = read csv. val DF2 = DF1.groupBy().agg().select(.....) Val DF3 = read csv .join(DF1).join(DF2) DF3 .save. If I do not cache DF2 or Df1 it is taking longer time . But i am doing 1 action only why do I need to cache. Thanks Amit