Thanks for the explanation Yin. Justin
On Tue, Apr 7, 2015 at 7:36 PM, Yin Huai <yh...@databricks.com> wrote: > I think the slowness is caused by the way that we serialize/deserialize > the value of a complex type. I have opened > https://issues.apache.org/jira/browse/SPARK-6759 to track the improvement. > > On Tue, Apr 7, 2015 at 6:59 PM, Justin Yip <yipjus...@prediction.io> > wrote: > >> The schema has a StructType. >> >> Justin >> >> On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai <yh...@databricks.com> wrote: >> >>> Hi Justin, >>> >>> Does the schema of your data have any decimal, array, map, or struct >>> type? >>> >>> Thanks, >>> >>> Yin >>> >>> On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip <yipjus...@prediction.io> >>> wrote: >>> >>>> Hello, >>>> >>>> I have a parquet file of around 55M rows (~ 1G on disk). Performing >>>> simple grouping operation is pretty efficient (I get results within 10 >>>> seconds). However, after called DataFrame.cache, I observe a significant >>>> performance degrade, the same operation now takes 3+ minutes. >>>> >>>> My hunch is that DataFrame cannot leverage its columnar format after >>>> persisting in memory. But cannot find anywhere from the doc mentioning >>>> this. >>>> >>>> Did I miss anything? >>>> >>>> Thanks! >>>> >>>> Justin >>>> >>> >>> >> >