----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11222/#review20877 -----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/FilterStringColLikeStringScalar.java <https://reviews.apache.org/r/11222/#comment42993> I think this code will fail for multi-byte characters beyond the standard Unicode code points. char is 16 bits. String in java represents supplementary characters by surrogate pairs (2 char values). If the 2nd char of a surrogate pair is the same as, say, '_' then this will fail. the Text class has a charAt() that returns an int (32 bit Unicode value). Maybe you could use that. Otherwise you may have to look for start byte characters in UTF-8. See http://docs.oracle.com/javase/6/docs/api/java/lang/String.html http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Text.html http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorStringExpressions.java <https://reviews.apache.org/r/11222/#comment42994> Add additional unit tests with strings that contain multi-byte characters, including ones beyond the standard 16 bit character range 0x0000 to 0xFFFF. Overall this looks good but I think there is a functional issue with characters beyond standard 16 bit Unicode. See the comments inline. - Eric Hanson On May 21, 2013, 12:14 p.m., Teddy Choi wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/11222/ > ----------------------------------------------------------- > > (Updated May 21, 2013, 12:14 p.m.) > > > Review request for hive. > > > Description > ------- > > I edited FilterStringColLikeStringScala.java as Eric Hanson wrote. > > For none-complex patterns, it calls a static method that doesn't call others > and uses its given byte arrays only. For complex patterns, it reuses a > ByteBuffer and a CharBuffer for decoding UTF-8 to avoid object constructions. > > There is 30%~170% performance improvement for all cases. Its benchmark result > is on https://issues.apache.org/jira/browse/HIVE-4548#comment-13660750. > > It still can be more efficient by using a template-driven approach. I'll > apply it soon. > > > This addresses bug HIVE-4548. > https://issues.apache.org/jira/browse/HIVE-4548 > > > Diffs > ----- > > > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/FilterStringColLikeStringScalar.java > 24ba861 > > ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorStringExpressions.java > 6e26412 > > Diff: https://reviews.apache.org/r/11222/diff/ > > > Testing > ------- > > > Thanks, > > Teddy Choi > >