[ https://issues.apache.org/jira/browse/HIVE-21042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720928#comment-16720928 ]
yucai commented on HIVE-21042: ------------------------------ In LazyHiveVarchar, it is decoded by Text.decode, while in LazyString, it is not. Does it work by design? LazyHiveVarchar: {code} byteData = Text.decode(bytes.getData(), start, length); data.set(byteData, maxLength); {code} LazyString: {code} data.set(bytes.getData(), start, length); {code} > varchar and string will behave differently for non-standard utf8 characters > --------------------------------------------------------------------------- > > Key: HIVE-21042 > URL: https://issues.apache.org/jira/browse/HIVE-21042 > Project: Hive > Issue Type: Bug > Components: SQL > Reporter: yucai > Priority: Major > Attachments: part-00007-7fa8e32c-59f3-485a-864c-3881b1a01413.txt > > > If data contains non-standard utf8 characters, varchar and string will behave > differently. > The content in /tmp/hex_data (as attached) is > 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 in hex. > *B0A5 and* *8DAEAB* are non-standard utf8 characters, they are encoded to > EFBFBD if the data type is varchar, but string will not change it. So: > VARCHAR in hex shows: > 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936 > STRING in hex shows: > 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 > See details: > {code:java} > hive> DROP TABLE TBL_S; > OK > Time taken: 0.562 seconds > hive> CREATE TABLE TBL_S > > ( > > GUID STRING > > ) > > row format delimited fields terminated by '\177' stored as textfile > > LOCATION > > '/tmp/hex_data' > > tblproperties('serialization.null.format'='', 'timestamp.formats' = > 'yyyy-MM-dd HH:mm:ss') > > ; > OK > Time taken: 3.074 seconds > hive> > > DROP TABLE TBL_V; > OK > Time taken: 0.894 seconds > hive> CREATE TABLE TBL_V > > ( > > GUID VARCHAR(32) > > ) > > row format delimited fields terminated by '\177' stored as textfile > > LOCATION > > '/tmp/hex_data' > > tblproperties('serialization.null.format'='', 'timestamp.formats' = > 'yyyy-MM-dd HH:mm:ss') > > ; > OK > Time taken: 0.242 seconds > hive> SELECT GUID, hex(GUID) FROM TBL_S; > OK > a0750c1f13f0k5��F8j���b�Ro'4da96 > 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 > Time taken: 1.581 seconds, Fetched: 1 row(s) > hive> SELECT GUID, hex(GUID) FROM TBL_V; > OK > a0750c1f13f0k5��F8j���b�Ro'4da96 > 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936 > {code} > Is it expected? -- This message was sent by Atlassian JIRA (v7.6.3#76005)