[ https://issues.apache.org/jira/browse/HIVE-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469143#comment-13469143 ]
shrikanth shankar commented on HIVE-1362: ----------------------------------------- I had a couple of high level comments on the patch that seem to fit better here rather than on the review board. Apologies if this violates protocol (1) The count_stats aggregation operator 'repeats' many existing aggregates that Hive already supports (count of nulls, count true's, max, min etc). It might make a lot more sense to just add an aggregate to return the approximate number of distinct values for a column. Any reason why stats collection cant just generate more expressions in the SQL? (2) There might even be value in adding a different UDAF which just returns a serialized numDV estimator. Storing this (instead of the count) could be useful in other places e.g. combining numDV estimates across partitions (A second UDAF would be needed to support aggregating these but that seems easy) > column level statistics > ----------------------- > > Key: HIVE-1362 > URL: https://issues.apache.org/jira/browse/HIVE-1362 > Project: Hive > Issue Type: Sub-task > Components: Statistics > Reporter: Ning Zhang > Assignee: Shreepadma Venugopalan > Attachments: HIVE-1362.1.patch.txt, HIVE-1362.2.patch.txt, > HIVE-1362.3.patch.txt, HIVE-1362-gen_thrift.1.patch.txt, > HIVE-1362-gen_thrift.2.patch.txt, HIVE-1362-gen_thrift.3.patch.txt > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira