Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Michael Armbrust
You could add a new ColumnType . PRs welcome :) On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel wrote: > Hi Michael, > > As a test, I have same data loaded as another parquet - excep

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Hi Michael, As a test, I have same data loaded as another parquet - except with the 2 decimal(14,4) replaced by double. With this, the on disk size is ~345MB, the in-memory size is 2GB (v.s. 12 GB) and the cached query runs in 1/2 the time of uncached query. Would it be possible for Spark to sto

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Could you share which data types are optimized in the in-memory storage and how are they optimized ? On Mon, Feb 9, 2015 at 2:33 PM, Michael Armbrust wrote: > You'll probably only get good compression for strings when dictionary > encoding works. We don't optimize decimals in the in-memory colu

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Michael Armbrust
You'll probably only get good compression for strings when dictionary encoding works. We don't optimize decimals in the in-memory columnar storage, so you are paying expensive serialization there likely. On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel wrote: > Flat data of types String, Int and cou

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Flat data of types String, Int and couple of decimal(14,4) On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust wrote: > Is this nested data or flat data? > > On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel > wrote: > >> Hi Michael, >> >> The storage tab shows the RDD resides fully in memory (10 partit

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Michael Armbrust
Is this nested data or flat data? On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel wrote: > Hi Michael, > > The storage tab shows the RDD resides fully in memory (10 partitions) with > zero disk usage. Tasks for subsequent select on this table in cache shows > minimal overheads (GC, queueing, shuffle

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Hi Michael, The storage tab shows the RDD resides fully in memory (10 partitions) with zero disk usage. Tasks for subsequent select on this table in cache shows minimal overheads (GC, queueing, shuffle write etc. etc.), so overhead is not issue. However, it is still twice as slow as reading uncach

Re: SQL group by on Parquet table slower when table cached

2015-02-06 Thread Michael Armbrust
Check the storage tab. Does the table actually fit in memory? Otherwise you are rebuilding column buffers in addition to reading the data off of the disk. On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel wrote: > Spark 1.2 > > Data stored in parquet table (large number of rows) > > Test 1 > > select

SQL group by on Parquet table slower when table cached

2015-02-06 Thread Manoj Samel
Spark 1.2 Data stored in parquet table (large number of rows) Test 1 select a, sum(b), sum(c) from table Test sqlContext.cacheTable() select a, sum(b), sum(c) from table - "seed cache" First time slow since loading cache ? select a, sum(b), sum(c) from table - Second time it should be faster