[
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973566#comment-13973566
]
Jason Dere commented on HIVE-6843:
----------------------------------
Should this also work for unicode characters which require more than one Java
character? If you add these checks to TestGenericUDFUtils, the 2nd check fails:
{code}
Assert.assertEquals(3, GenericUDFUtils.findText(new
Text("123\uD801\uDC00456"), new Text("\uD801\uDC00"), 0));
Assert.assertEquals(4, GenericUDFUtils.findText(new
Text("123\uD801\uDC00456"), new Text("4"), 0));
{code}
This would require using String.codePointCount() on the indexOf() result.
> INSTR for UTF-8 returns incorrect position
> ------------------------------------------
>
> Key: HIVE-6843
> URL: https://issues.apache.org/jira/browse/HIVE-6843
> Project: Hive
> Issue Type: Bug
> Components: UDF
> Affects Versions: 0.11.0, 0.12.0
> Reporter: Clif Kranish
> Assignee: Szehon Ho
> Priority: Minor
> Attachments: HIVE-6843.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)