[ https://issues.apache.org/jira/browse/HIVE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674426#comment-13674426 ]
Teddy Choi commented on HIVE-4642: ---------------------------------- Here is my draft spec. Please leave a comment. ---- The base version can be easily implemented with the basic template and the UDFRegExp class. It will be expensive, and it needs to be optimized more. Problem: Regular expression matcher is about 10+ times slower than prefix/suffix matcher(as shown in HIVE-4548). Because the Pattern is already compiled, it's hard to optimize the Pattern more. Matchers don't depend on each other, so they are distributable over threads. Also the base version will create new objects per call. These can be implemented more efficiently. Goal: Reduce object creations per call, and distribute matching loads over multiple threads. Cache and reuse a compiled pattern, a byte buffer, a char buffer, and a UTF-8 decoder as HIVE-4548. Divide matching tasks into groups, and run each group on different thread. Or apply the producer-consumer pattern. If there are enough idle CPU cores, total execution time will be reduced significantly. If it is feasible, implement prefix/suffix matchers for further optimization. People may use LIKE filter more for simpler filtering. So these matchers may not be used frequently but will run faster. > Implement vectorized RLIKE and REGEXP filter expressions > -------------------------------------------------------- > > Key: HIVE-4642 > URL: https://issues.apache.org/jira/browse/HIVE-4642 > Project: Hive > Issue Type: Sub-task > Reporter: Eric Hanson > Assignee: Teddy Choi > > See title. I will add more details next week. The goal is (a) make this work > correctly and (b) optimize it as well as possible, at least for the common > cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira