[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

Yin Huai (JIRA) Mon, 15 Jul 2013 23:23:43 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709510#comment-13709510
 ]


Yin Huai commented on HIVE-4113:
--------------------------------

[~brocknoland] Seems that we use setNeededColumnIDs in TableScanOperator to set 
needed columns in ColumnPrunerTableScanProc (in the class of 
ColumnPrunerProcFactory) and neededColumnIDs in TableScanOperator will never be 
a null. If I am right, for code in HiveInputFormat shown below ...
{code:java}
// push down projections
ArrayList<Integer> list = tableScan.getNeededColumnIDs();
if (list != null) {
  ColumnProjectionUtils.appendReadColumnIDs(jobConf, list);
} else {
  ColumnProjectionUtils.setReadAllColumns(jobConf);
}
{\code}
setReadAllColumns will never be called.

Also, assuming we use RCFile, if we have 'select count(1)', we will skip all 
columns. Seems that we can generate correct results because from the key 
buffer, we will know recordsNumInValBuffer (the number of rows in a row group) 
and we will call 'next' recordsNumInValBuffer times. Is my understanding 
correct? If so, do you think we should add some comments explaining it when we 
set all elements of skippedColIDs to true? I think that we can take advantage 
of recordsNumInValBuffer to do an improvement. Instead of calling 'next' 
recordsNumInValBuffer times, we can pass this number directly to 
GroupByOperator (I have not considered if it is easy to implement). We can 
reduce a lot of unnecessary function calls. If we want to do this improvement, 
we can work on it in a separate jira. 
                
> Optimize select count(1) with RCFile and Orc
> --------------------------------------------
>
>                 Key: HIVE-4113
>                 URL: https://issues.apache.org/jira/browse/HIVE-4113
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats
>            Reporter: Gopal V
>            Assignee: Brock Noland
>             Fix For: 0.12.0
>
>         Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch
>
>
> select count(1) loads up every column & every row when used with RCFile.
> "select count(1) from store_sales_10_rc" gives
> {code}
> Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
> HDFS Write: 8 SUCCESS
> {code}
> Where as, "select count(ss_sold_date_sk) from store_sales_10_rc;" reads far 
> less
> {code}
> Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
> HDFS Write: 8 SUCCESS
> {code}
> Which is 11% of the data size read by the COUNT(1).
> This was tracked down to the following code in RCFile.java
> {code}
>       } else {
>         // TODO: if no column name is specified e.g, in select count(1) from 
> tt;
>         // skip all columns, this should be distinguished from the case:
>         // select * from tt;
>         for (int i = 0; i < skippedColIDs.length; i++) {
>           skippedColIDs[i] = false;
>         }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

Reply via email to