Re: DataFrame degraded performance after DataFrame.cache

Justin Yip Tue, 07 Apr 2015 21:24:37 -0700

Thanks for the explanation Yin.

Justin


On Tue, Apr 7, 2015 at 7:36 PM, Yin Huai <yh...@databricks.com> wrote:

> I think the slowness is caused by the way that we serialize/deserialize
> the value of a complex type. I have opened
> https://issues.apache.org/jira/browse/SPARK-6759 to track the improvement.
>
> On Tue, Apr 7, 2015 at 6:59 PM, Justin Yip <yipjus...@prediction.io>
> wrote:
>
>> The schema has a StructType.
>>
>> Justin
>>
>> On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai <yh...@databricks.com> wrote:
>>
>>> Hi Justin,
>>>
>>> Does the schema of your data have any decimal, array, map, or struct
>>> type?
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>> On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip <yipjus...@prediction.io>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a parquet file of around 55M rows (~ 1G on disk). Performing
>>>> simple grouping operation is pretty efficient (I get results within 10
>>>> seconds). However, after called DataFrame.cache, I observe a significant
>>>> performance degrade, the same operation now takes 3+ minutes.
>>>>
>>>> My hunch is that DataFrame cannot leverage its columnar format after
>>>> persisting in memory. But cannot find anywhere from the doc mentioning 
>>>> this.
>>>>
>>>> Did I miss anything?
>>>>
>>>> Thanks!
>>>>
>>>> Justin
>>>>
>>>
>>>
>>
>

Re: DataFrame degraded performance after DataFrame.cache

Reply via email to