Re: SQL group by on Parquet table slower when table cached

Manoj Samel Mon, 09 Feb 2015 14:20:48 -0800

Flat data of types String, Int and couple of decimal(14,4)

On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust <mich...@databricks.com>
wrote:


> Is this nested data or flat data?
>
> On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel <manojsamelt...@gmail.com>
> wrote:
>
>> Hi Michael,
>>
>> The storage tab shows the RDD resides fully in memory (10 partitions)
>> with zero disk usage. Tasks for subsequent select on this table in cache
>> shows minimal overheads (GC, queueing, shuffle write etc. etc.), so
>> overhead is not issue. However, it is still twice as slow as reading
>> uncached table.
>>
>> I have spark.rdd.compress = true, 
>> spark.sql.inMemoryColumnarStorage.compressed
>> = true, spark.serializer = org.apache.spark.serializer.KryoSerializer
>>
>> Something that may be of relevance ...
>>
>> The underlying table is Parquet, 10 partitions totaling ~350 MB. For
>> mapPartition phase of query on uncached table shows input size of 351 MB.
>> However, after the table is cached, the storage shows the cache size as
>> 12GB. So the in-memory representation seems much bigger than on-disk, even
>> with the compression options turned on. Any thoughts on this ?
>>
>> mapPartition phase same query for cache table shows input size of 12GB
>> (full size of cache table) and takes twice the time as mapPartition for
>> uncached query.
>>
>> Thanks,
>>
>>
>>
>>
>>
>>
>> On Fri, Feb 6, 2015 at 6:47 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> Check the storage tab.  Does the table actually fit in memory? Otherwise
>>> you are rebuilding column buffers in addition to reading the data off of
>>> the disk.
>>>
>>> On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel <manojsamelt...@gmail.com>
>>> wrote:
>>>
>>>> Spark 1.2
>>>>
>>>> Data stored in parquet table (large number of rows)
>>>>
>>>> Test 1
>>>>
>>>> select a, sum(b), sum(c) from table
>>>>
>>>> Test
>>>>
>>>> sqlContext.cacheTable()
>>>> select a, sum(b), sum(c) from table  - "seed cache" First time slow
>>>> since loading cache ?
>>>> select a, sum(b), sum(c) from table  - Second time it should be faster
>>>> as it should be reading from cache, not HDFS. But it is slower than test1
>>>>
>>>> Any thoughts? Should a different query be used to seed cache ?
>>>>
>>>> Thanks,
>>>>
>>>>
>>>
>>
>

Re: SQL group by on Parquet table slower when table cached

Reply via email to