> On May 22, 2013, midnight, Eric Hanson wrote: > > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/FilterStringColLikeStringScalar.java, > > line 227 > > <https://reviews.apache.org/r/11222/diff/2/?file=295212#file295212line227> > > > > I think this code will fail for multi-byte characters beyond the > > standard Unicode code points. char is 16 bits. String in java represents > > supplementary characters by surrogate pairs (2 char values). If the 2nd > > char of a surrogate pair is the same as, say, '_' then this will fail. > > > > the Text class has a charAt() that returns an int (32 bit Unicode > > value). Maybe you could use that. Otherwise you may have to look for start > > byte characters in UTF-8. > > > > See http://docs.oracle.com/javase/6/docs/api/java/lang/String.html > > http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Text.html > > http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html > >
Thank you for your review. I never knew this problem. I will try the Text class first to solve it. This method is copied from the UDFLike class, so there is same problem. I'll consider a reusable method for those classes. - Teddy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11222/#review20877 ----------------------------------------------------------- On May 21, 2013, 12:14 p.m., Teddy Choi wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/11222/ > ----------------------------------------------------------- > > (Updated May 21, 2013, 12:14 p.m.) > > > Review request for hive. > > > Description > ------- > > I edited FilterStringColLikeStringScala.java as Eric Hanson wrote. > > For none-complex patterns, it calls a static method that doesn't call others > and uses its given byte arrays only. For complex patterns, it reuses a > ByteBuffer and a CharBuffer for decoding UTF-8 to avoid object constructions. > > There is 30%~170% performance improvement for all cases. Its benchmark result > is on https://issues.apache.org/jira/browse/HIVE-4548#comment-13660750. > > It still can be more efficient by using a template-driven approach. I'll > apply it soon. > > > This addresses bug HIVE-4548. > https://issues.apache.org/jira/browse/HIVE-4548 > > > Diffs > ----- > > > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/FilterStringColLikeStringScalar.java > 24ba861 > > ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorStringExpressions.java > 6e26412 > > Diff: https://reviews.apache.org/r/11222/diff/ > > > Testing > ------- > > > Thanks, > > Teddy Choi > >