[ 
https://issues.apache.org/jira/browse/HIVE-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446432#comment-13446432
 ] 

Shreepadma Venugopalan commented on HIVE-1362:
----------------------------------------------

This patch implements version 1 of the column statistics project in Hive. It 
adds support for computing and persisting statistical summary of column values 
in Hive Tables and Partitions. In order to support column statistics in Hive, 
this patch does the following,

* Adds a new compute stats UDAF to compute scalar statistics for all primitive 
Hive data types. In version 1 of the project, we support the following scalar 
statistics on primitive types - estimate of number of distinct values, number 
of null values, number of trues/falses for boolean typed columsn, max and avg 
length for string and binary typed columns, max and min value for long and 
double typed columns. Note that version 1 of the column stats project includes 
support for column statistics both at the table and partition level.

* Adds Metastore schema tables to persist the newly added statistics both at 
table and partition level.
* Adds Metastore Thrift API to persist, retrieve and delete column statistics 
at both table and partition level. 
Please refer to the following wiki link for the details of the schema and the 
Thrift API changes - 
https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive

* Extends the analyze table compute statistics statement to trigger statistics 
computation and persistence for one or more columns. Please note that 
statistics for multiple columns is computed through a single scan of the table 
data. Please refer to the following wiki link for the syntax changes - 
https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive

One thing missing from the patch at this point is the metastore upgrade scrips 
for MySQL/Derby/Postgres/Oracle. I'm waiting for the review to finalize the 
metastore schema changes before I go ahead and add the upgrade scripts.

In a follow on patch, as part of version 2 of the column statistics project, we 
will add support for computing, persisting and retrieving histograms on long 
and double typed column values.
                
> column level statistics
> -----------------------
>
>                 Key: HIVE-1362
>                 URL: https://issues.apache.org/jira/browse/HIVE-1362
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Statistics
>            Reporter: Ning Zhang
>            Assignee: Shreepadma Venugopalan
>         Attachments: HIVE-1362.1.patch.txt, HIVE-1362-gen_thrift.1.patch.txt
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to