[ 
https://issues.apache.org/jira/browse/HIVE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674426#comment-13674426
 ] 

Teddy Choi commented on HIVE-4642:
----------------------------------

Here is my draft spec. Please leave a comment.
----
The base version can be easily implemented with the basic template and the 
UDFRegExp class. It will be expensive, and it needs to be optimized more.

Problem: Regular expression matcher is about 10+ times slower than 
prefix/suffix matcher(as shown in HIVE-4548). Because the Pattern is already 
compiled, it's hard to optimize the Pattern more. Matchers don't depend on each 
other, so they are distributable over threads. Also the base version will 
create new objects per call. These can be implemented more efficiently.

Goal: Reduce object creations per call, and distribute matching loads over 
multiple threads.

Cache and reuse a compiled pattern, a byte buffer, a char buffer, and a UTF-8 
decoder as HIVE-4548.

Divide matching tasks into groups, and run each group on different thread. Or 
apply the producer-consumer pattern. If there are enough idle CPU cores, total 
execution time will be reduced significantly.

If it is feasible, implement prefix/suffix matchers for further optimization. 
People may use LIKE filter more for simpler filtering. So these matchers may 
not be used frequently but will run faster.
                
> Implement vectorized RLIKE and REGEXP filter expressions
> --------------------------------------------------------
>
>                 Key: HIVE-4642
>                 URL: https://issues.apache.org/jira/browse/HIVE-4642
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Eric Hanson
>            Assignee: Teddy Choi
>
> See title. I will add more details next week. The goal is (a) make this work 
> correctly and (b) optimize it as well as possible, at least for the common 
> cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to