I think the slowness is caused by the way that we serialize/deserialize the
value of a complex type. I have opened
https://issues.apache.org/jira/browse/SPARK-6759 to track the improvement.

On Tue, Apr 7, 2015 at 6:59 PM, Justin Yip <yipjus...@prediction.io> wrote:

> The schema has a StructType.
>
> Justin
>
> On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai <yh...@databricks.com> wrote:
>
>> Hi Justin,
>>
>> Does the schema of your data have any decimal, array, map, or struct type?
>>
>> Thanks,
>>
>> Yin
>>
>> On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip <yipjus...@prediction.io>
>> wrote:
>>
>>> Hello,
>>>
>>> I have a parquet file of around 55M rows (~ 1G on disk). Performing
>>> simple grouping operation is pretty efficient (I get results within 10
>>> seconds). However, after called DataFrame.cache, I observe a significant
>>> performance degrade, the same operation now takes 3+ minutes.
>>>
>>> My hunch is that DataFrame cannot leverage its columnar format after
>>> persisting in memory. But cannot find anywhere from the doc mentioning this.
>>>
>>> Did I miss anything?
>>>
>>> Thanks!
>>>
>>> Justin
>>>
>>
>>
>

Reply via email to