[ https://issues.apache.org/jira/browse/HIVE-21042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
yucai updated HIVE-21042: ------------------------- Description: If data contains malformed utf8 characters, varchar and string will behave differently. The content in /tmp/hex_data (as attached) is 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 in hex. *B0A5 and* *8DAEAB* are malformed utf8 characters, they are encoded to EFBFBD if the data type is varchar, but string will not change it. So: VARCHAR in hex shows: 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936 STRING in hex shows: 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 See details: {code:java} hive> DROP TABLE TBL_S; OK Time taken: 0.562 seconds hive> CREATE TABLE TBL_S > ( > GUID STRING > ) > row format delimited fields terminated by '\177' stored as textfile > LOCATION > '/tmp/hex_data' > tblproperties('serialization.null.format'='', 'timestamp.formats' = 'yyyy-MM-dd HH:mm:ss') > ; OK Time taken: 3.074 seconds hive> > DROP TABLE TBL_V; OK Time taken: 0.894 seconds hive> CREATE TABLE TBL_V > ( > GUID VARCHAR(32) > ) > row format delimited fields terminated by '\177' stored as textfile > LOCATION > '/tmp/hex_data' > tblproperties('serialization.null.format'='', 'timestamp.formats' = 'yyyy-MM-dd HH:mm:ss') > ; OK Time taken: 0.242 seconds hive> SELECT GUID, hex(GUID) FROM TBL_S; OK a0750c1f13f0k5��F8j���b�Ro'4da96 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 Time taken: 1.581 seconds, Fetched: 1 row(s) hive> SELECT GUID, hex(GUID) FROM TBL_V; OK a0750c1f13f0k5��F8j���b�Ro'4da96 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936 {code} Is it expected? was: If data contains non-standard utf8 characters, varchar and string will behave differently. The content in /tmp/hex_data (as attached) is 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 in hex. *B0A5 and* *8DAEAB* are non-standard utf8 characters, they are encoded to EFBFBD if the data type is varchar, but string will not change it. So: VARCHAR in hex shows: 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936 STRING in hex shows: 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 See details: {code:java} hive> DROP TABLE TBL_S; OK Time taken: 0.562 seconds hive> CREATE TABLE TBL_S > ( > GUID STRING > ) > row format delimited fields terminated by '\177' stored as textfile > LOCATION > '/tmp/hex_data' > tblproperties('serialization.null.format'='', 'timestamp.formats' = 'yyyy-MM-dd HH:mm:ss') > ; OK Time taken: 3.074 seconds hive> > DROP TABLE TBL_V; OK Time taken: 0.894 seconds hive> CREATE TABLE TBL_V > ( > GUID VARCHAR(32) > ) > row format delimited fields terminated by '\177' stored as textfile > LOCATION > '/tmp/hex_data' > tblproperties('serialization.null.format'='', 'timestamp.formats' = 'yyyy-MM-dd HH:mm:ss') > ; OK Time taken: 0.242 seconds hive> SELECT GUID, hex(GUID) FROM TBL_S; OK a0750c1f13f0k5��F8j���b�Ro'4da96 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 Time taken: 1.581 seconds, Fetched: 1 row(s) hive> SELECT GUID, hex(GUID) FROM TBL_V; OK a0750c1f13f0k5��F8j���b�Ro'4da96 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936 {code} Is it expected? Summary: varchar and string will behave differently for malformed utf8 characters (was: varchar and string will behave differently for non-standard utf8 characters) > varchar and string will behave differently for malformed utf8 characters > ------------------------------------------------------------------------ > > Key: HIVE-21042 > URL: https://issues.apache.org/jira/browse/HIVE-21042 > Project: Hive > Issue Type: Bug > Components: SQL > Reporter: yucai > Priority: Major > Attachments: part-00007-7fa8e32c-59f3-485a-864c-3881b1a01413.txt > > > If data contains malformed utf8 characters, varchar and string will behave > differently. > The content in /tmp/hex_data (as attached) is > 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 in hex. > *B0A5 and* *8DAEAB* are malformed utf8 characters, they are encoded to EFBFBD > if the data type is varchar, but string will not change it. So: > VARCHAR in hex shows: > 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936 > STRING in hex shows: > 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 > See details: > {code:java} > hive> DROP TABLE TBL_S; > OK > Time taken: 0.562 seconds > hive> CREATE TABLE TBL_S > > ( > > GUID STRING > > ) > > row format delimited fields terminated by '\177' stored as textfile > > LOCATION > > '/tmp/hex_data' > > tblproperties('serialization.null.format'='', 'timestamp.formats' = > 'yyyy-MM-dd HH:mm:ss') > > ; > OK > Time taken: 3.074 seconds > hive> > > DROP TABLE TBL_V; > OK > Time taken: 0.894 seconds > hive> CREATE TABLE TBL_V > > ( > > GUID VARCHAR(32) > > ) > > row format delimited fields terminated by '\177' stored as textfile > > LOCATION > > '/tmp/hex_data' > > tblproperties('serialization.null.format'='', 'timestamp.formats' = > 'yyyy-MM-dd HH:mm:ss') > > ; > OK > Time taken: 0.242 seconds > hive> SELECT GUID, hex(GUID) FROM TBL_S; > OK > a0750c1f13f0k5��F8j���b�Ro'4da96 > 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 > Time taken: 1.581 seconds, Fetched: 1 row(s) > hive> SELECT GUID, hex(GUID) FROM TBL_V; > OK > a0750c1f13f0k5��F8j���b�Ro'4da96 > 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936 > {code} > Is it expected? -- This message was sent by Atlassian JIRA (v7.6.3#76005)