I think the slowness is caused by the way that we serialize/deserialize the value of a complex type. I have opened https://issues.apache.org/jira/browse/SPARK-6759 to track the improvement.
On Tue, Apr 7, 2015 at 6:59 PM, Justin Yip <yipjus...@prediction.io> wrote: > The schema has a StructType. > > Justin > > On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai <yh...@databricks.com> wrote: > >> Hi Justin, >> >> Does the schema of your data have any decimal, array, map, or struct type? >> >> Thanks, >> >> Yin >> >> On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip <yipjus...@prediction.io> >> wrote: >> >>> Hello, >>> >>> I have a parquet file of around 55M rows (~ 1G on disk). Performing >>> simple grouping operation is pretty efficient (I get results within 10 >>> seconds). However, after called DataFrame.cache, I observe a significant >>> performance degrade, the same operation now takes 3+ minutes. >>> >>> My hunch is that DataFrame cannot leverage its columnar format after >>> persisting in memory. But cannot find anywhere from the doc mentioning this. >>> >>> Did I miss anything? >>> >>> Thanks! >>> >>> Justin >>> >> >> >