[ https://issues.apache.org/jira/browse/HIVE-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryu Kobayashi updated HIVE-27370: --------------------------------- Description: SUBSTR doesn't seem to support 4-byte characters. This also happens in master branch. Also, this does not happen in vectorized mode, so it is a problem specific to non-vectorized mode. An example is below: {code:java} -- vectorized mode create temporary table foo (str string) stored as orc; insert into foo values('あa🤎いiうu'); SELECT SUBSTR(str, 1, 3) as b1, SUBSTR(str, 3) as b2, SUBSTR(str, -5) as b3 FROM foo ; あa🤎 🤎いiうu 🤎いiうu {code} {code:java} -- non-vectorized SELECT SUBSTR('あa🤎いiうu', 1, 3) as b1, SUBSTR('あa🤎いiうu', 3) as b2, SUBSTR('あa🤎いiうu', -5) as b3 ; あa? �いiうu ?いiうu{code} was: SUBSTR doesn't seem to support 4-byte characters. This also happens in master branch. Also, this does not happen in vectorized mode, so it is a problem specific to non-vectorized mode. An example is below: {code:java} -- vectorized mode create temporary table foo (str string) stored as orc; insert into foo values('安佐町大字久地字野𨵱4614番地'), ('あa🤎いiうu'); SELECT SUBSTR(str, 1, 10) as a1, SUBSTR(str, 10, 3) as a2, SUBSTR(str, -7) as a3, substr(str, 1, 3) as b1, substr(str, 3) as b2, substr(str, -5) as b3 from foo ; 安佐町大字久地字野𨵱 𨵱4614番地 安佐町 町大字久地字野𨵱4614番地 614番地 あa🤎 あa🤎いiうu あa🤎 🤎いiうu 🤎いiうu {code} {code:java} -- non-vectorized SELECT SUBSTR('安佐町大字久地字野𨵱4614番地', 1, 10) as a1, SUBSTR('安佐町大字久地字野𨵱4614番地', 10, 3) as a2, SUBSTR('安佐町大字久地字野𨵱4614番地', -7) as a3, substr('あa🤎いiうu', 1, 3) as b1, substr('あa🤎いiうu', 3) as b2, substr('あa🤎いiうu', -5) as b3 ; 安佐町大字久地字野? �4 ?4614番地 あa? �いiうu ?いiうu{code} > SUBSTR UDF return '?' against 4-bytes character > ----------------------------------------------- > > Key: HIVE-27370 > URL: https://issues.apache.org/jira/browse/HIVE-27370 > Project: Hive > Issue Type: Bug > Components: UDF > Reporter: Ryu Kobayashi > Assignee: Ryu Kobayashi > Priority: Major > Labels: pull-request-available > > SUBSTR doesn't seem to support 4-byte characters. This also happens in master > branch. Also, this does not happen in vectorized mode, so it is a problem > specific to non-vectorized mode. An example is below: > {code:java} > -- vectorized mode > create temporary table foo (str string) stored as orc; > insert into foo values('あa🤎いiうu'); > SELECT > SUBSTR(str, 1, 3) as b1, > SUBSTR(str, 3) as b2, > SUBSTR(str, -5) as b3 > FROM foo > ; > あa🤎 🤎いiうu 🤎いiうu {code} > {code:java} > -- non-vectorized > SELECT > SUBSTR('あa🤎いiうu', 1, 3) as b1, > SUBSTR('あa🤎いiうu', 3) as b2, > SUBSTR('あa🤎いiうu', -5) as b3 > ; > あa? �いiうu ?いiうu{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)