Flat data of types String, Int and couple of decimal(14,4) On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust <mich...@databricks.com> wrote:
> Is this nested data or flat data? > > On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel <manojsamelt...@gmail.com> > wrote: > >> Hi Michael, >> >> The storage tab shows the RDD resides fully in memory (10 partitions) >> with zero disk usage. Tasks for subsequent select on this table in cache >> shows minimal overheads (GC, queueing, shuffle write etc. etc.), so >> overhead is not issue. However, it is still twice as slow as reading >> uncached table. >> >> I have spark.rdd.compress = true, >> spark.sql.inMemoryColumnarStorage.compressed >> = true, spark.serializer = org.apache.spark.serializer.KryoSerializer >> >> Something that may be of relevance ... >> >> The underlying table is Parquet, 10 partitions totaling ~350 MB. For >> mapPartition phase of query on uncached table shows input size of 351 MB. >> However, after the table is cached, the storage shows the cache size as >> 12GB. So the in-memory representation seems much bigger than on-disk, even >> with the compression options turned on. Any thoughts on this ? >> >> mapPartition phase same query for cache table shows input size of 12GB >> (full size of cache table) and takes twice the time as mapPartition for >> uncached query. >> >> Thanks, >> >> >> >> >> >> >> On Fri, Feb 6, 2015 at 6:47 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> Check the storage tab. Does the table actually fit in memory? Otherwise >>> you are rebuilding column buffers in addition to reading the data off of >>> the disk. >>> >>> On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel <manojsamelt...@gmail.com> >>> wrote: >>> >>>> Spark 1.2 >>>> >>>> Data stored in parquet table (large number of rows) >>>> >>>> Test 1 >>>> >>>> select a, sum(b), sum(c) from table >>>> >>>> Test >>>> >>>> sqlContext.cacheTable() >>>> select a, sum(b), sum(c) from table - "seed cache" First time slow >>>> since loading cache ? >>>> select a, sum(b), sum(c) from table - Second time it should be faster >>>> as it should be reading from cache, not HDFS. But it is slower than test1 >>>> >>>> Any thoughts? Should a different query be used to seed cache ? >>>> >>>> Thanks, >>>> >>>> >>> >> >