[jira] [Commented] (HIVE-4247) Filtering on a hbase row key duplicates results across multiple mappers

Angela Li (JIRA) Mon, 09 Sep 2013 14:14:21 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762297#comment-13762297
 ]


Angela Li commented on HIVE-4247:
---------------------------------

Could someone please raise the priority in this? Thx!
                
> Filtering on a hbase row key duplicates results across multiple mappers
> -----------------------------------------------------------------------
>
>                 Key: HIVE-4247
>                 URL: https://issues.apache.org/jira/browse/HIVE-4247
>             Project: Hive
>          Issue Type: Bug
>          Components: HBase Handler
>    Affects Versions: 0.9.0
>         Environment: All Platforms
>            Reporter: Karthik Kumara
>              Labels: patch
>         Attachments: HiveHBaseTableInputFormat.patch
>
>
> Steps to reproduce
> 1. Create a Hive external table with HiveHbaseHandler with enough data in the 
> hbase table to spawn multiple mappers for the hive query.
> 2. Write a query which has a filter (in the where clause) based on the hbase 
> row key. 
> 3. Running the map reduce job leads to each mapper querying the entire data 
> set.  duplicating the data for each mapper. Each mapper processes the entire 
> filtered range and the results get multiplied as the number of mappers run.
> Expected behavior:
> Each mapper should process a different part of the data and should not 
> duplicate.
> Cause:
> The cause seems to be the convertFilter method in HiveHBaseTableInputFormat. 
> convertFilter has this piece of code which rewrites the start and the stop 
> row for each split which leads each mapper to process the entire range
>  if (tableSplit != null) {
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         startRow,
>         stopRow,
>         tableSplit.getRegionLocation());
>     }
> The scan already has the start and stop row set when the splits are created. 
> So this piece of code is probably redundant.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4247) Filtering on a hbase row key duplicates results across multiple mappers

Reply via email to