[ https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482075#comment-13482075 ]
Lianhui Wang commented on HIVE-3420: ------------------------------------ @Gong Deng yes,i agree with you.in InputFormat getRecordReader() tableSplit = convertFilter(jobConf, scan, tableSplit, iKey, getStorageFormatOfKey(columnsMapping.get(iKey).mappingSpec, jobConf.get(HBaseSerDe.HBASE_TABLE_DEFAULT_STORAGE_TYPE, "string"))); it have done tableSplit = new TableSplit( tableSplit.getTableName(), startRow, stopRow, tableSplit.getRegionLocation(), tableSplit.getConf()); also in getplits(),a tableSplit lead to a regionLocation task.now that splits have not any effect. so startRow,stopRow in tableSplit is inside the region row ranges in tableSplit. IMO,the convertFilter() logic code used in many places.for example: HBaseStorageHandler.decomposePredicate() HiveHBaseTableInputFormat.getSplits() HiveHBaseTableInputFormat.getRecordReader() i think there need one place to use it. in HBaseStorageHandler.decomposePredicate().and that can store row key ranges. and then HiveHBaseTableInputFormat.getSplits(),HiveHBaseTableInputFormat.getRecordReader() according to table's regioninfo split the key ranges tasks. other have ideas?thx. > Inefficiency in hbase handler when process query including rowkey range scan > ---------------------------------------------------------------------------- > > Key: HIVE-3420 > URL: https://issues.apache.org/jira/browse/HIVE-3420 > Project: Hive > Issue Type: Improvement > Components: HBase Handler > Affects Versions: 0.9.0 > Environment: Hive-0.9.0 + HBase-0.94.1 > Reporter: Gang Deng > Priority: Critical > Original Estimate: 2h > Remaining Estimate: 2h > > When query hive with hbase rowkey range, hive map tasks do not leverage > startrow, endrow information in tablesplit. For example, if the rowkeys fit > into 5 hbase files, then where will be 5 map tasks. Ideally, each task will > process 1 file. But in current implementation, each task processes 5 files > repeatedly. The behavior not only waste network bandwidth, but also worse the > lock contention in HBase block cache as each task have to access the same > block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below: > …… > if (tableSplit != null) { > tableSplit = new TableSplit( > tableSplit.getTableName(), > startRow, > stopRow, > tableSplit.getRegionLocation()); > } > scan.setStartRow(startRow); > scan.setStopRow(stopRow); > …… > As tableSplit already include startRow, endRow information of file, the > better implementation will be: > …… > byte[] splitStart = startRow; > byte[] splitStop = stopRow; > if (tableSplit != null) { > > if(tableSplit.getStartRow() != null){ > splitStart = startRow.length == 0 || > Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ? > tableSplit.getStartRow() : startRow; > } > if(tableSplit.getEndRow() != null){ > splitStop = (stopRow.length == 0 || > Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) && > tableSplit.getEndRow().length > 0 ? > tableSplit.getEndRow() : stopRow; > } > tableSplit = new TableSplit( > tableSplit.getTableName(), > splitStart, > splitStop, > tableSplit.getRegionLocation()); > } > scan.setStartRow(splitStart); > scan.setStopRow(splitStop); > …… > In my test, the changed code will improve performance more than 30%. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira