[ https://issues.apache.org/jira/browse/HIVE-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HIVE-27370: ---------------------------------- Labels: pull-request-available (was: ) > SUBSTR UDF return '?' against 4-bytes character > ----------------------------------------------- > > Key: HIVE-27370 > URL: https://issues.apache.org/jira/browse/HIVE-27370 > Project: Hive > Issue Type: Bug > Components: UDF > Affects Versions: All Versions > Reporter: Ryu Kobayashi > Assignee: Ryu Kobayashi > Priority: Major > Labels: pull-request-available > > SUBSTR doesn't seem to support 4-byte characters. This also happens in master > branch. Also, this does not happen in vectorized mode, so it is a problem > specific to non-vectorized mode. An example is below: > {code:java} > -- vectorized mode > create temporary table foo (str string) stored as orc; > insert into foo values('安佐町大字久地字野𨵱4614番地'), ('あa🤎いiうu'); > SELECT > SUBSTR(str, 1, 10) as a1, > SUBSTR(str, 10, 3) as a2, > SUBSTR(str, -7) as a3, > substr(str, 1, 3) as b1, > substr(str, 3) as b2, > substr(str, -5) as b3 > from foo > ; > 安佐町大字久地字野𨵱 𨵱4614番地 安佐町 町大字久地字野𨵱4614番地 614番地 > あa🤎 あa🤎いiうu あa🤎 🤎いiうu 🤎いiうu {code} > {code:java} > -- non-vectorized > SELECT > SUBSTR('安佐町大字久地字野𨵱4614番地', 1, 10) as a1, > SUBSTR('安佐町大字久地字野𨵱4614番地', 10, 3) as a2, > SUBSTR('安佐町大字久地字野𨵱4614番地', -7) as a3, > substr('あa🤎いiうu', 1, 3) as b1, > substr('あa🤎いiうu', 3) as b2, > substr('あa🤎いiうu', -5) as b3 > ; > 安佐町大字久地字野? �4 ?4614番地 あa? �いiうu ?いiうu{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)