[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

Lianhui Wang (JIRA) Mon, 22 Oct 2012 20:13:15 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482075#comment-13482075
 ]


Lianhui Wang commented on HIVE-3420:
------------------------------------

@Gong Deng
yes,i agree with you.in InputFormat getRecordReader()
tableSplit = convertFilter(jobConf, scan, tableSplit, iKey,
      getStorageFormatOfKey(columnsMapping.get(iKey).mappingSpec,
      jobConf.get(HBaseSerDe.HBASE_TABLE_DEFAULT_STORAGE_TYPE, "string")));
it have done
tableSplit = new TableSplit(
        tableSplit.getTableName(),
        startRow,
        stopRow,
        tableSplit.getRegionLocation(),
        tableSplit.getConf());
also in getplits(),a tableSplit lead to a regionLocation task.now that splits 
have not any effect. 
so startRow,stopRow in tableSplit is inside the region row ranges in tableSplit.

IMO,the convertFilter() logic code used in many places.for example:
HBaseStorageHandler.decomposePredicate()
HiveHBaseTableInputFormat.getSplits()
HiveHBaseTableInputFormat.getRecordReader()

i think there need one place to use it. in 
HBaseStorageHandler.decomposePredicate().and that can store row key ranges.
and then 
HiveHBaseTableInputFormat.getSplits(),HiveHBaseTableInputFormat.getRecordReader()
 according to table's regioninfo split the key ranges tasks.

other have ideas?thx.


                
> Inefficiency in hbase handler when process query including rowkey range scan
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-3420
>                 URL: https://issues.apache.org/jira/browse/HIVE-3420
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>    Affects Versions: 0.9.0
>         Environment: Hive-0.9.0 + HBase-0.94.1
>            Reporter: Gang Deng
>            Priority: Critical
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
>     if (tableSplit != null) {
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         startRow,
>         stopRow,
>         tableSplit.getRegionLocation());
>     }
>     scan.setStartRow(startRow);
>     scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
>         ……
>         byte[] splitStart = startRow;
>         byte[] splitStop = stopRow;
>     if (tableSplit != null) {
>                 
>            if(tableSplit.getStartRow() != null){
>                         splitStart = startRow.length == 0 ||
>           Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
>             tableSplit.getStartRow() : startRow;
>                 }
>                 if(tableSplit.getEndRow() != null){
>                         splitStop = (stopRow.length == 0 ||
>           Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>           tableSplit.getEndRow().length > 0 ?
>             tableSplit.getEndRow() : stopRow;
>                 }                       
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         splitStart,
>         splitStop,
>         tableSplit.getRegionLocation());
>     }
>     scan.setStartRow(splitStart);
>     scan.setStopRow(splitStop);
>         ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

Reply via email to