Re: SQL group by on Parquet table slower when table cached

Michael Armbrust Mon, 09 Feb 2015 14:00:49 -0800

Is this nested data or flat data?

On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel <manojsamelt...@gmail.com>
wrote:


> Hi Michael,
>
> The storage tab shows the RDD resides fully in memory (10 partitions) with
> zero disk usage. Tasks for subsequent select on this table in cache shows
> minimal overheads (GC, queueing, shuffle write etc. etc.), so overhead is
> not issue. However, it is still twice as slow as reading uncached table.
>
> I have spark.rdd.compress = true, spark.sql.inMemoryColumnarStorage.compressed
> = true, spark.serializer = org.apache.spark.serializer.KryoSerializer
>
> Something that may be of relevance ...
>
> The underlying table is Parquet, 10 partitions totaling ~350 MB. For
> mapPartition phase of query on uncached table shows input size of 351 MB.
> However, after the table is cached, the storage shows the cache size as
> 12GB. So the in-memory representation seems much bigger than on-disk, even
> with the compression options turned on. Any thoughts on this ?
>
> mapPartition phase same query for cache table shows input size of 12GB
> (full size of cache table) and takes twice the time as mapPartition for
> uncached query.
>
> Thanks,
>
>
>
>
>
>
> On Fri, Feb 6, 2015 at 6:47 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Check the storage tab.  Does the table actually fit in memory? Otherwise
>> you are rebuilding column buffers in addition to reading the data off of
>> the disk.
>>
>> On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel <manojsamelt...@gmail.com>
>> wrote:
>>
>>> Spark 1.2
>>>
>>> Data stored in parquet table (large number of rows)
>>>
>>> Test 1
>>>
>>> select a, sum(b), sum(c) from table
>>>
>>> Test
>>>
>>> sqlContext.cacheTable()
>>> select a, sum(b), sum(c) from table  - "seed cache" First time slow
>>> since loading cache ?
>>> select a, sum(b), sum(c) from table  - Second time it should be faster
>>> as it should be reading from cache, not HDFS. But it is slower than test1
>>>
>>> Any thoughts? Should a different query be used to seed cache ?
>>>
>>> Thanks,
>>>
>>>
>>
>

Re: SQL group by on Parquet table slower when table cached

Reply via email to