[ 
https://issues.apache.org/jira/browse/HIVE-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470686#comment-13470686
 ] 

Shreepadma Venugopalan commented on HIVE-1362:
----------------------------------------------

I assume when you say row level statistics you are referring to table 
statistics. Today, table statistics is stored as part of the table_params. 
table_params table gets mapped to the TTable object in memory and it looks like 
the existing APIs sufficed. We want to have a dedicated Thrift API for column 
stats for the following reasons,

1. Column statistics is a property of the column and not the table and hence 
doesn't belong with the table_params. Furthermore, we have seen customers with 
tables that are 100s-1000s of columns wide. Storing this information as a 
table_param is going to bloat, and it will also make the output of DESCRIBE 
EXTENDED unreadable.

2. We want column statistics to be a first class metadata. In order to do so, 
we have to provide dedicated Thrift APIs to query and update it. We want the 
Thrift API to be self-documenting, i.e. if someone tells you that metastore 
supports column stats, you should be able to look at the Thrift IDL and figure 
out which method you need to use to store/retrieve column stats. Right now a 
lot of the API doesn't satisfy that goal since many methods are overloaded, and 
other features are implemented by adding new key/value properties to different 
catalog objects that aren't easy to document via the thrift API

3. Additionally storing column statistics as a key/value pair in the 
table_params table is not space efficient. We need to repeat the keys for each 
one of the columns in the table for which statistics is gathered. Furthermore, 
by storing column stats in the table_params table we would de-normalize the 
schema completely and incur a performance penalty performing self-joins, though 
not necessarily in the metasote db, to retrieve the statistics associated with 
a column. 
                
> column level statistics
> -----------------------
>
>                 Key: HIVE-1362
>                 URL: https://issues.apache.org/jira/browse/HIVE-1362
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Statistics
>            Reporter: Ning Zhang
>            Assignee: Shreepadma Venugopalan
>         Attachments: HIVE-1362.1.patch.txt, HIVE-1362.2.patch.txt, 
> HIVE-1362.3.patch.txt, HIVE-1362.4.patch.txt, 
> HIVE-1362-gen_thrift.1.patch.txt, HIVE-1362-gen_thrift.2.patch.txt, 
> HIVE-1362-gen_thrift.3.patch.txt, HIVE-1362-gen_thrift.4.patch.txt
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to