Re: New ColumnType For Decimal Caching

Michael Armbrust Sun, 15 Feb 2015 02:54:07 -0800

That sound right to me.  Cheng could elaborate if you are missing something.


On Fri, Feb 13, 2015 at 11:36 AM, Manoj Samel <manojsamelt...@gmail.com>
wrote:

> Thanks Michael for the pointer & Sorry for the delayed reply.
>
> Taking a quick inventory of scope of change - Is the column type for
> Decimal caching needed only in the caching layer (4 files
> in org.apache.spark.sql.columnar - ColumnAccessor.scala,
> ColumnBuilder.scala, ColumnStats.scala, ColumnType.scala)
>
> Or do other SQL components also need to be touched ?
>
> Hoping for a quick feedback of top of your head ...
>
> Thanks,
>
>
>
> On Mon, Feb 9, 2015 at 3:16 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> You could add a new ColumnType
>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala>
>> .
>>
>> PRs welcome :)
>>
>> On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel <manojsamelt...@gmail.com>
>> wrote:
>>
>>> Hi Michael,
>>>
>>> As a test, I have same data loaded as another parquet - except with the
>>> 2 decimal(14,4) replaced by double. With this, the  on disk size is ~345MB,
>>> the in-memory size is 2GB (v.s. 12 GB) and the cached query runs in 1/2 the
>>> time of uncached query.
>>>
>>> Would it be possible for Spark to store in-memory decimal in some form
>>> of long with decoration ?
>>>
>>> For the immediate future, is there any hook that we can use to provide
>>> custom caching / processing for the decimal type in RDD so other semantic
>>> does not changes ?
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>> On Mon, Feb 9, 2015 at 2:41 PM, Manoj Samel <manojsamelt...@gmail.com>
>>> wrote:
>>>
>>>> Could you share which data types are optimized in the in-memory storage
>>>> and how are they optimized ?
>>>>
>>>> On Mon, Feb 9, 2015 at 2:33 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> You'll probably only get good compression for strings when dictionary
>>>>> encoding works.  We don't optimize decimals in the in-memory columnar
>>>>> storage, so you are paying expensive serialization there likely.
>>>>>
>>>>> On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel <manojsamelt...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Flat data of types String, Int and couple of decimal(14,4)
>>>>>>
>>>>>> On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust <
>>>>>> mich...@databricks.com> wrote:
>>>>>>
>>>>>>> Is this nested data or flat data?
>>>>>>>
>>>>>>> On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel <
>>>>>>> manojsamelt...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Michael,
>>>>>>>>
>>>>>>>> The storage tab shows the RDD resides fully in memory (10
>>>>>>>> partitions) with zero disk usage. Tasks for subsequent select on this 
>>>>>>>> table
>>>>>>>> in cache shows minimal overheads (GC, queueing, shuffle write etc. 
>>>>>>>> etc.),
>>>>>>>> so overhead is not issue. However, it is still twice as slow as reading
>>>>>>>> uncached table.
>>>>>>>>
>>>>>>>> I have spark.rdd.compress = true, 
>>>>>>>> spark.sql.inMemoryColumnarStorage.compressed
>>>>>>>> = true, spark.serializer =
>>>>>>>> org.apache.spark.serializer.KryoSerializer
>>>>>>>>
>>>>>>>> Something that may be of relevance ...
>>>>>>>>
>>>>>>>> The underlying table is Parquet, 10 partitions totaling ~350 MB.
>>>>>>>> For mapPartition phase of query on uncached table shows input size of 
>>>>>>>> 351
>>>>>>>> MB. However, after the table is cached, the storage shows the cache 
>>>>>>>> size as
>>>>>>>> 12GB. So the in-memory representation seems much bigger than on-disk, 
>>>>>>>> even
>>>>>>>> with the compression options turned on. Any thoughts on this ?
>>>>>>>>
>>>>>>>> mapPartition phase same query for cache table shows input size of
>>>>>>>> 12GB (full size of cache table) and takes twice the time as 
>>>>>>>> mapPartition
>>>>>>>> for uncached query.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 6, 2015 at 6:47 PM, Michael Armbrust <
>>>>>>>> mich...@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> Check the storage tab.  Does the table actually fit in memory?
>>>>>>>>> Otherwise you are rebuilding column buffers in addition to reading 
>>>>>>>>> the data
>>>>>>>>> off of the disk.
>>>>>>>>>
>>>>>>>>> On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel <
>>>>>>>>> manojsamelt...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Spark 1.2
>>>>>>>>>>
>>>>>>>>>> Data stored in parquet table (large number of rows)
>>>>>>>>>>
>>>>>>>>>> Test 1
>>>>>>>>>>
>>>>>>>>>> select a, sum(b), sum(c) from table
>>>>>>>>>>
>>>>>>>>>> Test
>>>>>>>>>>
>>>>>>>>>> sqlContext.cacheTable()
>>>>>>>>>> select a, sum(b), sum(c) from table  - "seed cache" First time
>>>>>>>>>> slow since loading cache ?
>>>>>>>>>> select a, sum(b), sum(c) from table  - Second time it should be
>>>>>>>>>> faster as it should be reading from cache, not HDFS. But it is 
>>>>>>>>>> slower than
>>>>>>>>>> test1
>>>>>>>>>>
>>>>>>>>>> Any thoughts? Should a different query be used to seed cache ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: New ColumnType For Decimal Caching

Reply via email to