Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
Thanks for the explanation Yin. Justin On Tue, Apr 7, 2015 at 7:36 PM, Yin Huai wrote: > I think the slowness is caused by the way that we serialize/deserialize > the value of a complex type. I have opened > https://issues.apache.org/jira/browse/SPARK-6759 to track the improvement. > > On Tue,

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai
I think the slowness is caused by the way that we serialize/deserialize the value of a complex type. I have opened https://issues.apache.org/jira/browse/SPARK-6759 to track the improvement. On Tue, Apr 7, 2015 at 6:59 PM, Justin Yip wrote: > The schema has a StructType. > > Justin > > On Tue, Ap

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
The schema has a StructType. Justin On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai wrote: > Hi Justin, > > Does the schema of your data have any decimal, array, map, or struct type? > > Thanks, > > Yin > > On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip > wrote: > >> Hello, >> >> I have a parquet file of

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai
Hi Justin, Does the schema of your data have any decimal, array, map, or struct type? Thanks, Yin On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip wrote: > Hello, > > I have a parquet file of around 55M rows (~ 1G on disk). Performing simple > grouping operation is pretty efficient (I get results w

DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
Hello, I have a parquet file of around 55M rows (~ 1G on disk). Performing simple grouping operation is pretty efficient (I get results within 10 seconds). However, after called DataFrame.cache, I observe a significant performance degrade, the same operation now takes 3+ minutes. My hunch is that