[jira] [Commented] (HIVE-1362) column level statistics

shrikanth shankar (JIRA) Wed, 03 Oct 2012 21:47:18 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469143#comment-13469143
 ]


shrikanth shankar commented on HIVE-1362:
-----------------------------------------

I had a couple of high level comments on the patch that seem to fit better here 
rather than on the review board. Apologies if this violates protocol
(1) The count_stats aggregation operator 'repeats' many existing aggregates 
that Hive already supports (count of nulls, count true's, max, min etc). It 
might make a lot more sense to just add an aggregate to return the approximate 
number of distinct values for a column. Any reason why stats collection cant 
just generate more expressions in the SQL?
(2) There might even be value in adding a different UDAF which just returns a 
serialized numDV estimator. Storing this (instead of the count) could be useful 
in other places e.g. combining numDV estimates across partitions (A second UDAF 
would be needed to support aggregating these but that seems easy)
                
> column level statistics
> -----------------------
>
>                 Key: HIVE-1362
>                 URL: https://issues.apache.org/jira/browse/HIVE-1362
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Statistics
>            Reporter: Ning Zhang
>            Assignee: Shreepadma Venugopalan
>         Attachments: HIVE-1362.1.patch.txt, HIVE-1362.2.patch.txt, 
> HIVE-1362.3.patch.txt, HIVE-1362-gen_thrift.1.patch.txt, 
> HIVE-1362-gen_thrift.2.patch.txt, HIVE-1362-gen_thrift.3.patch.txt
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-1362) column level statistics

Reply via email to