[ https://issues.apache.org/jira/browse/HIVE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660750#comment-13660750 ]
Teddy Choi commented on HIVE-4548: ---------------------------------- I edited FilterStringColLikeStringScala.java as [~ehans] wrote. For none-complex patterns, it calls a static method that doesn't call others and uses its given byte arrays only. For complex patterns, it reuses a ByteBuffer and a CharBuffer for decoding UTF-8 to avoid object constructions. There is 30%~170% performance improvement for all cases. Its benchmark result is attached. {noformat} test: [echo] Project: ql [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/Users/pudidic/IdeaProjects/hive/build/ivy/lib/hadoop0.20S.shim/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.hadoop.hive.ql.exec.vector.expressions.TestFilterStringColLikeStringScalar [junit] ---- [junit] mix% [junit] new 1077ms. [junit] old 2908ms. [junit] 170.00928% faster [junit] ---- [junit] %Up [junit] new 1008ms. [junit] old 2244ms. [junit] 122.61906% faster [junit] ---- [junit] %dU% [junit] new 1792ms. [junit] old 3350ms. [junit] 86.94197% faster [junit] ---- [junit] m%dU% [junit] new 17290ms. [junit] old 24224ms. [junit] 40.104103% faster [junit] ---- [junit] mixedUp [junit] new 1347ms. [junit] old 2907ms. [junit] 115.81292% faster [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 58.683 sec BUILD SUCCESSFUL Total time: 1 minute 57 seconds {noformat} It still can be more efficient by using a template-driven approach. I'll apply it soon. > Speed up vectorized LIKE filter for special cases abc%, %abc and %abc% > ---------------------------------------------------------------------- > > Key: HIVE-4548 > URL: https://issues.apache.org/jira/browse/HIVE-4548 > Project: Hive > Issue Type: Sub-task > Affects Versions: vectorization-branch > Reporter: Eric Hanson > Assignee: Teddy Choi > Priority: Minor > Fix For: vectorization-branch > > Attachments: HIVE-4548.1-with-benchmark.patch.txt, > HIVE-4548.1-without-benchmark.patch.txt > > > Speed up vectorized LIKE filter evaluation for abc%, %abc, and %abc% pattern > special cases (here, abc is just a place holder for some fixed string). > > Problem: The current vectorized LIKE implementation always calls the standard > LIKE function code in UDFLike.java. But this is pretty expensive. It calls > multiple functions and allocates at least one new object per call. Probably > 80% of uses of LIKE are for the simple patterns abc%, %abc, and %abc%. These > can be implemented much more efficiently. > Start by speeding up the case for > Column LIKE "abc%" > > The goal would be to minimize expense in the inner loop. Don't use new() in > the inner loop, and write a static function that checks the prefix of the > string matches the like pattern as efficiently as possible, operating > directly on the byte array holding UTF-8-encoded string data, and avoiding > unnecessary additional function calls and if/else logic. Call that in the > inner loop. > If feasible, consider using a template-driven approach, with an instance of > the template expanded for each of the three cases. Start doing the abc% > (prefix match) by hand, then consider templatizing for the other two cases. > The code is in the "vectorization" branch of the main hive repo. > > Start by checking in the constructor for FilterStringColLikeStringScalar.java > if the pattern is one of the simple special cases. If so, record that, and > have the evaluate() method call a special-case function for each case, i.e. > the general case, and each of the 3 special cases. All the dynamic > decision-making would be done once per vector, not once per element. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira