[ 
https://issues.apache.org/jira/browse/HIVE-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009938#comment-14009938
 ] 

Craig Condit commented on HIVE-1643:
------------------------------------

The patch as-is has a few issues...

First, at least in Hive 0.12, it interacts badly when multiple tables are 
joined. I've seen cases where it was clear that Hive was attempting to push 
down predicates for the wrong table, leading to NullPointerExceptions when the 
column is looked up and not found since the HBase storage handler assumes that 
any predicate that it receives will be for a valid column. I suspect this must 
be a bug in the query optimizer, but have not been able to determine exactly 
where.

Second, the fallback behavior when a complex query predicate is passed down is 
to punt on the entire expression, even if it could be partially evaluated (for 
example rowkey >= 'A' AND rowkey < 'B' AND ([complex bit])). This leads to 
unexpected full table scans in HBase. At the very least, the code should try 
really hard to at least handle the rowkey parts if possible. This can happen 
unexpectedly, if a single term uses an operator that the storage handler does 
not have a case for.

Third, even if the predicate pushdown works, this often results in secondary 
issues when interacting with HBase. In a case where no rowkey expression 
exists, it's possible to run very high CPU usage on HBase to evaluate the 
filters, and even get HBase RPC timeouts if enough rows are filtered out to 
result in no data being returned quickly enough. It would be nice to be able to 
control (somehow) which expressions the code tries to push down.

At our location, we didn't even try to port the patch to Hive 0.13 when we 
upgraded, mainly due to issues #2 and #3. Fortunately, CTEs have allowed us to 
ensure that only rowkey predicates get pushed down like so:

{noformat}
with a as (select ... from hbase_table where rowkey >= 'start' and rowkey < 
'end') do select * from a where ...;
{noformat}

It might be more useful for Hive-HBase integration to focus on ensuring that 
rowkey predicates are always pushed down (except for things like OR/NOT 
expressions, etc.) rather than trying to push down other types of expressions.



> support range scans and non-key columns in HBase filter pushdown
> ----------------------------------------------------------------
>
>                 Key: HIVE-1643
>                 URL: https://issues.apache.org/jira/browse/HIVE-1643
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>    Affects Versions: 0.9.0
>            Reporter: John Sichi
>            Assignee: bharath v
>              Labels: patch
>         Attachments: HIVE-1643.patch, Hive-1643.2.patch, hbase_handler.patch
>
>
> HIVE-1226 added support for WHERE rowkey=3.  We would like to support WHERE 
> rowkey BETWEEN 10 and 20, as well as predicates on non-rowkeys (plus 
> conjunctions etc).  Non-rowkey conditions can't be used to filter out entire 
> ranges, but they can be used to push the per-row filter processing as far 
> down as possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to