Qing Miao created HIVE-22734: -------------------------------- Summary: orc multi-byte character varchar type stored in some truncation Key: HIVE-22734 URL: https://issues.apache.org/jira/browse/HIVE-22734 Project: Hive Issue Type: Improvement Components: Database/Schema Affects Versions: 2.3.6 Environment: unbuntu and centos7
Reporter: Qing Miao hi , I 'm a noob new one ... but I use hive for some years , I create a table with one column as varhcar(6) with orc an insert a multi-byte content in the table as below hive> insert into mq1 values ('一二三四五六七') ; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = mq5445_20200116144748_cb87f769-9d3f-4b3b-b384-92c22b8ef06a Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2020-01-16 14:47:52,024 Stage-1 map = 100%, reduce = 0% Ended Job = job_local484725283_0001 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://wsl:9000/user/hive/warehouse/mq1/.hive-staging_hive_2020-01-16_14-47-48_936_2091348056955954494-1/-ext-10000 Loading data to table default.mq1 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 524 HDFS Write: 315 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 5.467 seconds hive> select * from mq1 ; OK 一二 一二 Time taken: 0.301 seconds, Fetched: 2 row(s) hive> show create table mq1 ; OK CREATE TABLE `mq1`( `col1` varchar(6)) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://wsl:9000/user/hive/warehouse/mq1' TBLPROPERTIES ( 'transient_lastDdlTime'='1579157273') Time taken: 0.281 seconds, Fetched: 12 row(s) It seems cannot store as six multi-byte word as mysql , for chinese in utf8 , it stored only 2 word for 3byte each in utf8 . And in hive other format , for example , text format , parquet work well in this situation . My hive version is 2.3.6/2.2.0 for hadoop 2.7.0 ,orc cannot work well . It seems that orc project fix some in version 1.6.2 and I just change the orc-core-1.6.2.jar in the hive lib. It does not work well either . -- This message was sent by Atlassian Jira (v8.3.4#803005)