[ 
https://issues.apache.org/jira/browse/HIVE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660750#comment-13660750
 ] 

Teddy Choi commented on HIVE-4548:
----------------------------------

I edited FilterStringColLikeStringScala.java as [~ehans] wrote.

For none-complex patterns, it calls a static method that doesn't call others 
and uses its given byte arrays only. For complex patterns, it reuses a 
ByteBuffer and a CharBuffer for decoding UTF-8 to avoid object constructions.

There is 30%~170% performance improvement for all cases. Its benchmark result 
is attached.

{noformat}
test:
     [echo] Project: ql
    [junit] WARNING: multiple versions of ant detected in path for junit 
    [junit]          
jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
    [junit]      and 
jar:file:/Users/pudidic/IdeaProjects/hive/build/ivy/lib/hadoop0.20S.shim/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
    [junit] Running 
org.apache.hadoop.hive.ql.exec.vector.expressions.TestFilterStringColLikeStringScalar
    [junit] ----
    [junit] mix%
    [junit] new 1077ms.
    [junit] old 2908ms.
    [junit] 170.00928%  faster
    [junit] ----
    [junit] %Up
    [junit] new 1008ms.
    [junit] old 2244ms.
    [junit] 122.61906%  faster
    [junit] ----
    [junit] %dU%
    [junit] new 1792ms.
    [junit] old 3350ms.
    [junit] 86.94197%   faster
    [junit] ----
    [junit] m%dU%
    [junit] new 17290ms.
    [junit] old 24224ms.
    [junit] 40.104103%  faster
    [junit] ----
    [junit] mixedUp
    [junit] new 1347ms.
    [junit] old 2907ms.
    [junit] 115.81292%  faster
    [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 58.683 sec

BUILD SUCCESSFUL
Total time: 1 minute 57 seconds
{noformat}

It still can be more efficient by using a template-driven approach. I'll apply 
it soon.
                
> Speed up vectorized LIKE filter for special cases abc%, %abc and %abc%
> ----------------------------------------------------------------------
>
>                 Key: HIVE-4548
>                 URL: https://issues.apache.org/jira/browse/HIVE-4548
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: vectorization-branch
>            Reporter: Eric Hanson
>            Assignee: Teddy Choi
>            Priority: Minor
>             Fix For: vectorization-branch
>
>         Attachments: HIVE-4548.1-with-benchmark.patch.txt, 
> HIVE-4548.1-without-benchmark.patch.txt
>
>
> Speed up vectorized LIKE filter evaluation for abc%, %abc, and %abc% pattern 
> special cases (here, abc is just a place holder for some fixed string).  
>   
> Problem: The current vectorized LIKE implementation always calls the standard 
> LIKE function code in UDFLike.java. But this is pretty expensive. It calls 
> multiple functions and allocates at least one new object per call. Probably 
> 80% of uses of LIKE are for the simple patterns abc%, %abc, and %abc%.  These 
> can be implemented much more efficiently.
> Start by speeding up the case for  
>     Column LIKE "abc%"
>   
> The goal would be to minimize expense in the inner loop. Don't use new() in 
> the inner loop, and write a static function that checks the prefix of the 
> string matches the like pattern as efficiently as possible, operating 
> directly on the byte array holding UTF-8-encoded string data, and avoiding 
> unnecessary additional function calls and if/else logic. Call that in the 
> inner loop.
> If feasible, consider using a template-driven approach, with an instance of 
> the template expanded for each of the three cases. Start doing the abc% 
> (prefix match) by hand, then consider templatizing for the other two cases.
> The code is in the "vectorization" branch of the main hive repo.
>   
> Start by checking in the constructor for FilterStringColLikeStringScalar.java 
> if the pattern is one of the simple special cases. If so, record that, and 
> have the evaluate() method call a special-case function for each case, i.e. 
> the general case, and each of the 3 special cases. All the dynamic 
> decision-making would be done once per vector, not once per element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to