[ https://issues.apache.org/jira/browse/HIVE-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762297#comment-13762297 ]
Angela Li commented on HIVE-4247: --------------------------------- Could someone please raise the priority in this? Thx! > Filtering on a hbase row key duplicates results across multiple mappers > ----------------------------------------------------------------------- > > Key: HIVE-4247 > URL: https://issues.apache.org/jira/browse/HIVE-4247 > Project: Hive > Issue Type: Bug > Components: HBase Handler > Affects Versions: 0.9.0 > Environment: All Platforms > Reporter: Karthik Kumara > Labels: patch > Attachments: HiveHBaseTableInputFormat.patch > > > Steps to reproduce > 1. Create a Hive external table with HiveHbaseHandler with enough data in the > hbase table to spawn multiple mappers for the hive query. > 2. Write a query which has a filter (in the where clause) based on the hbase > row key. > 3. Running the map reduce job leads to each mapper querying the entire data > set. duplicating the data for each mapper. Each mapper processes the entire > filtered range and the results get multiplied as the number of mappers run. > Expected behavior: > Each mapper should process a different part of the data and should not > duplicate. > Cause: > The cause seems to be the convertFilter method in HiveHBaseTableInputFormat. > convertFilter has this piece of code which rewrites the start and the stop > row for each split which leads each mapper to process the entire range > if (tableSplit != null) { > tableSplit = new TableSplit( > tableSplit.getTableName(), > startRow, > stopRow, > tableSplit.getRegionLocation()); > } > The scan already has the start and stop row set when the splits are created. > So this piece of code is probably redundant. > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira