Re: Caching

Sean Owen Mon, 07 Dec 2020 10:01:46 -0800

No, it's not true that one action means every DF is evaluated once. This is
a good counterexample.


On Mon, Dec 7, 2020 at 11:47 AM Amit Sharma <resolve...@gmail.com> wrote:

> Thanks for the information. I am using  spark 2.3.3 There are few more
> questions
>
> 1. Yes I am using DF1 two times but at the end action is one on DF3. In
> that case action of DF1 should be just 1 or it depends how many times this
> dataframe is used in transformation.
>
> I believe even if we use a dataframe multiple times for transformation ,
> use caching should be based on actions. In my case action is one save call
> on DF3. Please correct me if i am wrong.
>
> Thanks
> Amit
>
> On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <
> theo.gkountou...@futurewei.com> wrote:
>
>> Hi Amit,
>>
>>
>>
>> One action might use the same DataFrame more than once. You can look at
>> your LogicalPlan by executing DF3.explain (arguments different depending
>> the version of Spark you are using) and see how many times you need to
>> compute DF2 or DF1. Given the information you have provided I suspect that
>> DF1 is used more than once (one time at  DF2 and another one at DF3). So,
>> Spark is going to cache it the first time and it will load it from cache
>> instead of running it again the second time.
>>
>>
>>
>> I hope this helped,
>>
>> Theo.
>>
>>
>>
>> *From:* Amit Sharma <resolve...@gmail.com>
>> *Sent:* Monday, December 7, 2020 11:32 AM
>> *To:* user@spark.apache.org
>> *Subject:* Caching
>>
>>
>>
>> Hi All, I am using caching in my code. I have a DF like
>>
>> val  DF1 = read csv.
>>
>> val DF2 = DF1.groupBy().agg().select(.....)
>>
>>
>>
>> Val DF3 =  read csv .join(DF1).join(DF2)
>>
>>   DF3 .save.
>>
>>
>>
>> If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1
>> action only why do I need to cache.
>>
>>
>>
>> Thanks
>>
>> Amit
>>
>>
>>
>>
>>
>

Re: Caching

Reply via email to