[ https://issues.apache.org/jira/browse/HIVE-22734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Qing Miao updated HIVE-22734: ----------------------------- Description: hi , I 'm a noob new one ... but I use hive for some years , I create a table with one column as varhcar(6) with orc an insert a multi-byte content in the table as below hive> insert into mq1 values ('一二三四五六七') ; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = mq5445_20200116144748_cb87f769-9d3f-4b3b-b384-92c22b8ef06a Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2020-01-16 14:47:52,024 Stage-1 map = 100%, reduce = 0% Ended Job = job_local484725283_0001 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://wsl:9000/user/hive/warehouse/mq1/.hive-staging_hive_2020-01-16_14-47-48_936_2091348056955954494-1/-ext-10000 Loading data to table default.mq1 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 524 HDFS Write: 315 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 5.467 seconds hive> select * from mq1 ; OK 一二 一二 Time taken: 0.301 seconds, Fetched: 2 row(s) hive> show create table mq1 ; OK CREATE TABLE `mq1`( `col1` varchar(6)) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://wsl:9000/user/hive/warehouse/mq1' TBLPROPERTIES ( 'transient_lastDdlTime'='1579157273') Time taken: 0.281 seconds, Fetched: 12 row(s) It seems cannot store as six multi-byte word as mysql , for chinese in utf8 , it stored only 2 word for 3byte each in utf8 . And in hive other format , for example , text format , parquet work well in this situation . My hive version is 2.3.6/2.2.0 for hadoop 2.7.0 ,orc cannot work well . It seems that orc project fix some in version 1.6.2 and I just change the orc-core-1.6.2.jar in the hive lib. It does not work well either . hive> insert into mq2 values ('一二三四五六七') ; hive> insert into mq2 values ('一二三四五六七') ; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Query ID = mq5445_20200116152037_0799cb92-b6d4-4e25-9544-b0213768217aTotal jobs = 3Launching Job 1 out of 3Number of reduce tasks is set to 0 since there's no reduce operator('一二三四五六七') ;Job running in-process (local Hadoop)SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".SLF4J: Defaulting to no-operation (NOP) logger implementationSLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.2020-01-16 15:20:40,127 Stage-1 map = 0%, reduce = 0%2020-01-16 15:20:41,137 Stage-1 map = 100%, reduce = 0%Ended Job = job_local2085128098_0002Stage-4 is selected by condition resolver.Stage-3 is filtered out by condition resolver.Stage-5 is filtered out by condition resolver.Moving data to directory hdfs://wsl:9000/user/hive/warehouse/mq2/.hive-staging_hive_2020-01-16_15-20-37_380_7016274963079907260-1/-ext-10000Loading data to table default.mq2MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 1165 HDFS Write: 701 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOKTime taken: 4.627 secondshive> select * from mq2 ;NoViableAltException(352@[]) at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36710) at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:36987) at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36920) at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:36987) at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36633) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35822) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35710) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2284) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1333) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:244) at org.apache.hadoop.util.RunJar.main(RunJar.java:158)FAILED: ParseException line 1:1 cannot recognize input near ''一二三四五六七'' ')' '<EOF>' in statementhive> select * from mq2 ;OK一二三四五六Time taken: 0.536 seconds, Fetched: 1 row(s) was: hi , I 'm a noob new one ... but I use hive for some years , I create a table with one column as varhcar(6) with orc an insert a multi-byte content in the table as below hive> insert into mq1 values ('一二三四五六七') ; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = mq5445_20200116144748_cb87f769-9d3f-4b3b-b384-92c22b8ef06a Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2020-01-16 14:47:52,024 Stage-1 map = 100%, reduce = 0% Ended Job = job_local484725283_0001 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://wsl:9000/user/hive/warehouse/mq1/.hive-staging_hive_2020-01-16_14-47-48_936_2091348056955954494-1/-ext-10000 Loading data to table default.mq1 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 524 HDFS Write: 315 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 5.467 seconds hive> select * from mq1 ; OK 一二 一二 Time taken: 0.301 seconds, Fetched: 2 row(s) hive> show create table mq1 ; OK CREATE TABLE `mq1`( `col1` varchar(6)) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://wsl:9000/user/hive/warehouse/mq1' TBLPROPERTIES ( 'transient_lastDdlTime'='1579157273') Time taken: 0.281 seconds, Fetched: 12 row(s) It seems cannot store as six multi-byte word as mysql , for chinese in utf8 , it stored only 2 word for 3byte each in utf8 . And in hive other format , for example , text format , parquet work well in this situation . My hive version is 2.3.6/2.2.0 for hadoop 2.7.0 ,orc cannot work well . It seems that orc project fix some in version 1.6.2 and I just change the orc-core-1.6.2.jar in the hive lib. It does not work well either . > orc multi-byte character varchar type stored in some truncation > --------------------------------------------------------------- > > Key: HIVE-22734 > URL: https://issues.apache.org/jira/browse/HIVE-22734 > Project: Hive > Issue Type: Improvement > Components: Database/Schema > Affects Versions: 2.3.6 > Environment: unbuntu and centos7 > > Reporter: Qing Miao > Priority: Major > Labels: hive, orc, utf-8 > > hi , I 'm a noob new one ... > but I use hive for some years , > > I create a table with one column as varhcar(6) with orc > an insert a multi-byte content in the table as below > > > hive> insert into mq1 values ('一二三四五六七') ; > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the > future versions. Consider using a different execution engine (i.e. spark, > tez) or using Hive 1.X releases. > Query ID = mq5445_20200116144748_cb87f769-9d3f-4b3b-b384-92c22b8ef06a > Total jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks is set to 0 since there's no reduce operator > Job running in-process (local Hadoop) > 2020-01-16 14:47:52,024 Stage-1 map = 100%, reduce = 0% > Ended Job = job_local484725283_0001 > Stage-4 is selected by condition resolver. > Stage-3 is filtered out by condition resolver. > Stage-5 is filtered out by condition resolver. > Moving data to directory > hdfs://wsl:9000/user/hive/warehouse/mq1/.hive-staging_hive_2020-01-16_14-47-48_936_2091348056955954494-1/-ext-10000 > Loading data to table default.mq1 > MapReduce Jobs Launched: > Stage-Stage-1: HDFS Read: 524 HDFS Write: 315 SUCCESS > Total MapReduce CPU Time Spent: 0 msec > OK > Time taken: 5.467 seconds > hive> select * from mq1 ; > OK > 一二 > 一二 > Time taken: 0.301 seconds, Fetched: 2 row(s) > hive> show create table mq1 ; > OK > CREATE TABLE `mq1`( > `col1` varchar(6)) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION > 'hdfs://wsl:9000/user/hive/warehouse/mq1' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1579157273') > Time taken: 0.281 seconds, Fetched: 12 row(s) > > It seems cannot store as six multi-byte word as mysql , for chinese in utf8 , > it stored only 2 word for 3byte each in utf8 . > And in hive other format , for example , text format , parquet work well in > this situation . > My hive version is 2.3.6/2.2.0 for hadoop 2.7.0 ,orc cannot work well . > It seems that orc project fix some in version 1.6.2 and I just change the > orc-core-1.6.2.jar in the hive lib. > It does not work well either . > > hive> insert into mq2 values ('一二三四五六七') ; hive> insert into mq2 values > ('一二三四五六七') ; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be > available in the future versions. Consider using a different execution engine > (i.e. spark, tez) or using Hive 1.X releases.Query ID = > mq5445_20200116152037_0799cb92-b6d4-4e25-9544-b0213768217aTotal jobs = > 3Launching Job 1 out of 3Number of reduce tasks is set to 0 since there's no > reduce operator('一二三四五六七') ;Job running in-process (local Hadoop)SLF4J: > Failed to load class "org.slf4j.impl.StaticLoggerBinder".SLF4J: Defaulting to > no-operation (NOP) logger implementationSLF4J: See > http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details.2020-01-16 15:20:40,127 Stage-1 map = 0%, reduce = 0%2020-01-16 > 15:20:41,137 Stage-1 map = 100%, reduce = 0%Ended Job = > job_local2085128098_0002Stage-4 is selected by condition resolver.Stage-3 is > filtered out by condition resolver.Stage-5 is filtered out by condition > resolver.Moving data to directory > hdfs://wsl:9000/user/hive/warehouse/mq2/.hive-staging_hive_2020-01-16_15-20-37_380_7016274963079907260-1/-ext-10000Loading > data to table default.mq2MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: > 1165 HDFS Write: 701 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOKTime > taken: 4.627 secondshive> select * from mq2 ;NoViableAltException(352@[]) at > org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36710) > at > org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:36987) > at > org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36920) > at > org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:36987) > at > org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36633) > at > org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35822) > at > org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35710) > at > org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2284) > at > org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1333) at > org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208) at > org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77) at > org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70) at > org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468) at > org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at > org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at > org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at > org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) at > org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at > org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at > org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at > org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.hadoop.util.RunJar.run(RunJar.java:244) at > org.apache.hadoop.util.RunJar.main(RunJar.java:158)FAILED: ParseException > line 1:1 cannot recognize input near ''一二三四五六七'' ')' '<EOF>' in > statementhive> select * from mq2 ;OK一二三四五六Time taken: 0.536 seconds, Fetched: > 1 row(s) > -- This message was sent by Atlassian Jira (v8.3.4#803005)