[ https://issues.apache.org/jira/browse/HIVE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678442#comment-13678442 ]
Remus Rusanu commented on HIVE-4676: ------------------------------------ The count(*) aggregate is already optimized to just increment the count with the batch size. We just need to make sure the iterator doe snot read extra columns. > Optimize COUNT(*) aggregate over vectorized ORC execution path > -------------------------------------------------------------- > > Key: HIVE-4676 > URL: https://issues.apache.org/jira/browse/HIVE-4676 > Project: Hive > Issue Type: Sub-task > Components: Query Processor > Affects Versions: vectorization-branch > Reporter: Eric Hanson > > The COUNT\(*\) aggregate with the vectorized execution path over ORC should > be optimized because it is a very common case. > Given a table factsqlengineam_vec_orc with about 25 columns and 218 million > rows, this query > select count\(*\) from factsqlengineam_vec_orc; > runs in 2 minutes 15 seconds, with HDFS Read 2,000,078,555 > and this query > select count(mrowflag) from factsqlengineam_vec_orc; > runs in 42 seconds, with HDFS Read 1,207,855 > Because the column mrowflag is non-null, both queries return the same result. > We should optimize count\(\*\) so that it, say, chooses the most-compressed > column from the ORC file (or even a single random column) and counts those > values, but logically counts null values too so the meaning is the same as > count\(*\). The vectorized iterator should not have to load all columns, just > one column minimum, and any columns being filtered in the WHERE clause. > For scalar count\(*\) aggregates (i.e. without group-by) we can simply tally > up the total number of remaining rows in each batch, without even looking at > the data. Maybe we're already doing that but we are reading more data than > necessary now. > The query > select count\(\*\) from factsqlengineam_vec_orc where mrowflag > 0; > Takes only 41 seconds and also reads 1,207,855 bytes, so it appears that when > there is no WHERE clause, more data is read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira