[jira] [Commented] (HIVE-4676) Optimize COUNT(*) aggregate over vectorized ORC execution path

Remus Rusanu (JIRA) Fri, 07 Jun 2013 14:09:44 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678442#comment-13678442
 ]


Remus Rusanu commented on HIVE-4676:
------------------------------------

The count(*) aggregate is already optimized to just increment the count with 
the batch size. We just need to make sure the iterator doe snot read extra 
columns.
                
> Optimize COUNT(*) aggregate over vectorized ORC execution path
> --------------------------------------------------------------
>
>                 Key: HIVE-4676
>                 URL: https://issues.apache.org/jira/browse/HIVE-4676
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>    Affects Versions: vectorization-branch
>            Reporter: Eric Hanson
>
> The COUNT\(*\) aggregate with the vectorized execution path over ORC should 
> be optimized because it is a very common case.
> Given a table factsqlengineam_vec_orc with about 25 columns and 218 million 
> rows, this query
> select count\(*\) from factsqlengineam_vec_orc;
> runs in 2 minutes 15 seconds, with HDFS Read 2,000,078,555
> and this query
> select count(mrowflag) from factsqlengineam_vec_orc;
> runs in 42 seconds, with HDFS Read 1,207,855
> Because the column mrowflag is non-null, both queries return the same result.
> We should optimize count\(\*\) so that it, say, chooses the most-compressed 
> column from the ORC file (or even a single random column) and counts those 
> values, but logically counts null values too so the meaning is the same as 
> count\(*\). The vectorized iterator should not have to load all columns, just 
> one column minimum, and any columns being filtered in the WHERE clause.
> For scalar count\(*\) aggregates (i.e. without group-by) we can simply tally 
> up the total number of remaining rows in each batch, without even looking at 
> the data. Maybe we're already doing that but we are reading more data than 
> necessary now.
> The query 
> select count\(\*\) from factsqlengineam_vec_orc where mrowflag > 0;
> Takes only 41 seconds and also reads 1,207,855 bytes, so it appears that when 
> there is no WHERE clause, more data is read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4676) Optimize COUNT(*) aggregate over vectorized ORC execution path

Reply via email to