[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709510#comment-13709510 ]
Yin Huai commented on HIVE-4113: -------------------------------- [~brocknoland] Seems that we use setNeededColumnIDs in TableScanOperator to set needed columns in ColumnPrunerTableScanProc (in the class of ColumnPrunerProcFactory) and neededColumnIDs in TableScanOperator will never be a null. If I am right, for code in HiveInputFormat shown below ... {code:java} // push down projections ArrayList<Integer> list = tableScan.getNeededColumnIDs(); if (list != null) { ColumnProjectionUtils.appendReadColumnIDs(jobConf, list); } else { ColumnProjectionUtils.setReadAllColumns(jobConf); } {\code} setReadAllColumns will never be called. Also, assuming we use RCFile, if we have 'select count(1)', we will skip all columns. Seems that we can generate correct results because from the key buffer, we will know recordsNumInValBuffer (the number of rows in a row group) and we will call 'next' recordsNumInValBuffer times. Is my understanding correct? If so, do you think we should add some comments explaining it when we set all elements of skippedColIDs to true? I think that we can take advantage of recordsNumInValBuffer to do an improvement. Instead of calling 'next' recordsNumInValBuffer times, we can pass this number directly to GroupByOperator (I have not considered if it is easy to implement). We can reduce a lot of unnecessary function calls. If we want to do this improvement, we can work on it in a separate jira. > Optimize select count(1) with RCFile and Orc > -------------------------------------------- > > Key: HIVE-4113 > URL: https://issues.apache.org/jira/browse/HIVE-4113 > Project: Hive > Issue Type: Bug > Components: File Formats > Reporter: Gopal V > Assignee: Brock Noland > Fix For: 0.12.0 > > Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch > > > select count(1) loads up every column & every row when used with RCFile. > "select count(1) from store_sales_10_rc" gives > {code} > Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 > HDFS Write: 8 SUCCESS > {code} > Where as, "select count(ss_sold_date_sk) from store_sales_10_rc;" reads far > less > {code} > Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 > HDFS Write: 8 SUCCESS > {code} > Which is 11% of the data size read by the COUNT(1). > This was tracked down to the following code in RCFile.java > {code} > } else { > // TODO: if no column name is specified e.g, in select count(1) from > tt; > // skip all columns, this should be distinguished from the case: > // select * from tt; > for (int i = 0; i < skippedColIDs.length; i++) { > skippedColIDs[i] = false; > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira