Re: Is there a difference between df.cache() vs df.rdd.cache()

Supun Nakandala Fri, 13 Oct 2017 23:12:47 -0700

Hi Weichen,

Thank you very much for the explanation.


On Fri, Oct 13, 2017 at 6:56 PM, Weichen Xu <weichen...@databricks.com>
wrote:

> Hi Supun,
>
> Dataframe API is NOT using the old RDD implementation under the covers,
> dataframe has its own implementation. (Dataframe use binary row format and
> columnar storage when cached). So dataframe has no relationship with the
> `RDD[Row]` you want get.
>
> When calling `df.rdd`, and then cache, it need to turn this dataframe into
> rdd, it will extract each row from dataframe, unserialize them, and compose
> the new RDD.
>
> Thanks!
>
> On Sat, Oct 14, 2017 at 6:17 AM, Stephen Boesch <java...@gmail.com> wrote:
>
>> @Vadim   Would it be true to say the `.rdd` *may* be creating a new job -
>> depending on whether the DataFrame/DataSet had already been materialized
>> via an action or checkpoint?   If the only prior operations on the
>> DataFrame had been transformations then the dataframe would still not have
>> been calculated.  In that case would it also be true that a subsequent
>> action/checkpoint on the DataFrame (not the rdd) would then generate a
>> separate job?
>>
>> 2017-10-13 14:50 GMT-07:00 Vadim Semenov <vadim.seme...@datadoghq.com>:
>>
>>> When you do `Dataset.rdd` you actually create a new job
>>>
>>> here you can see what it does internally:
>>> https://github.com/apache/spark/blob/master/sql/core/src/mai
>>> n/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828
>>>
>>>
>>>
>>> On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <
>>> supun.nakand...@gmail.com> wrote:
>>>
>>>> Hi Weichen,
>>>>
>>>> Thank you for the reply.
>>>>
>>>> My understanding was Dataframe API is using the old RDD implementation
>>>> under the covers though it presents a different API. And calling
>>>> df.rdd will simply give access to the underlying RDD. Is this assumption
>>>> wrong? I would appreciate if you can shed more insights on this issue or
>>>> point me to documentation where I can learn them.
>>>>
>>>> Thank you in advance.
>>>>
>>>> On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu <weichen...@databricks.com>
>>>> wrote:
>>>>
>>>>> You should use `df.cache()`
>>>>> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from
>>>>> the original `df`. and then cache the new RDD.
>>>>>
>>>>> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
>>>>> supun.nakand...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have been experimenting with cache/persist/unpersist methods with
>>>>>> respect to both Dataframes and RDD APIs. However, I am experiencing
>>>>>> different behaviors Ddataframe API compared RDD API such Dataframes are 
>>>>>> not
>>>>>> getting cached when count() is called.
>>>>>>
>>>>>> Is there a difference between how these operations act wrt to
>>>>>> Dataframe and RDD APIs?
>>>>>>
>>>>>> Thank You.
>>>>>> -Supun
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Is there a difference between df.cache() vs df.rdd.cache()

Reply via email to