[ https://issues.apache.org/jira/browse/HIVE-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17921983#comment-17921983 ]
Ryu Kobayashi commented on HIVE-27370: -------------------------------------- I create small fixes so I created a new PR. https://github.com/apache/hive/pull/5624 > SUBSTR UDF return '?' against 4-bytes character > ----------------------------------------------- > > Key: HIVE-27370 > URL: https://issues.apache.org/jira/browse/HIVE-27370 > Project: Hive > Issue Type: Bug > Components: UDF > Reporter: Ryu Kobayashi > Assignee: Ryu Kobayashi > Priority: Major > Labels: pull-request-available > > SUBSTR doesn't seem to support 4-byte characters. This also happens in master > branch. Also, this does not happen in vectorized mode, so it is a problem > specific to non-vectorized mode. An example is below: > {code:java} > -- vectorized mode > create temporary table foo (str string) stored as orc; > insert into foo values('あa🤎いiうu'); > SELECT > SUBSTR(str, 1, 3) as b1, > SUBSTR(str, 3) as b2, > SUBSTR(str, -5) as b3 > FROM foo > ; > あa🤎 🤎いiうu 🤎いiうu {code} > {code:java} > -- non-vectorized > SELECT > SUBSTR('あa🤎いiうu', 1, 3) as b1, > SUBSTR('あa🤎いiうu', 3) as b2, > SUBSTR('あa🤎いiうu', -5) as b3 > ; > あa? �いiうu ?いiうu{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)