[ https://issues.apache.org/jira/browse/HIVE-18553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340731#comment-16340731 ]
Colin Ma commented on HIVE-18553: --------------------------------- Hi [~vihangk1], I did some investigation on this problem: Step 1: with out vectorization, check the impacts for ORC and Parquet when adding a new column, the following is the related statements: {code:java} create table test_p_parquet(t1 tinyint, t2 tinyint, i1 int, i2 int) stored as parquet; create table test_p_orc(t1 tinyint, t2 tinyint, i1 int, i2 int) stored as orc; insert into test_p_parquet values (1,2,3,4),(5,6,7,8); insert into test_p_orc values (1,2,3,4),(5,6,7,8); alter table test_p_parquet add columns (ts timestamp); alter table test_p_orc add columns (ts timestamp); select * from test_p_parquet; 1 2 3 4 NULL 5 6 7 8 NULL select * from test_p_orc; 1 2 3 4 NULL 5 6 7 8 NULL{code} The result is what we expected by now, but when insert new data, there has some problems: {code:java} insert into test_p_parquet values (11,12,13,14,15); insert into test_p_orc values (11,12,13,14,15); select * from test_p_parquet; 1 2 3 4 NULL 5 6 7 8 NULL 11 12 13 14 NULL select * from test_p_orc; 1 2 3 4 NULL 5 6 7 8 NULL 11 12 13 14 NULL {code} The new column still is null, new data is lost. From this result, I think with Parquet and ORC, Hive can't add data to new column. Step 2: with vectorization, the result of Parquet is what you describe, and the result of ORC is also incorrect which is an empty result. Check the implementation of VectorizedParquetRecordReader, the exception is because of the new column doesn't exist in parquet file. That means the data file won't change when new column added. I think the root problem is if Hive support add column to Parquet/ORC dynamically, and it's not the problem of VectorizedParquetRecordReader. > VectorizedParquetReader fails after adding a new column to table > ---------------------------------------------------------------- > > Key: HIVE-18553 > URL: https://issues.apache.org/jira/browse/HIVE-18553 > Project: Hive > Issue Type: Sub-task > Affects Versions: 3.0.0, 2.4.0, 2.3.2 > Reporter: Vihang Karajgaonkar > Priority: Major > > VectorizedParquetReader throws an exception when trying to reading from a > parquet table on which new columns are added. Steps to reproduce below: > {code} > 0: jdbc:hive2://localhost:10000/default> desc test_p; > +-----------+------------+----------+ > | col_name | data_type | comment | > +-----------+------------+----------+ > | t1 | tinyint | | > | t2 | tinyint | | > | i1 | int | | > | i2 | int | | > +-----------+------------+----------+ > 0: jdbc:hive2://localhost:10000/default> set hive.fetch.task.conversion=none; > 0: jdbc:hive2://localhost:10000/default> set > hive.vectorized.execution.enabled=true; > 0: jdbc:hive2://localhost:10000/default> alter table test_p add columns (ts > timestamp); > 0: jdbc:hive2://localhost:10000/default> select * from test_p; > Error: Error while processing statement: FAILED: Execution Error, return code > 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2) > {code} > Following exception is seen in the logs > {code} > Caused by: java.lang.IllegalArgumentException: [ts] BINARY is not in the > store: [[i1] INT32, [i2] INT32, [t1] INT32, [t2] INT32] 3 > at > org.apache.parquet.hadoop.ColumnChunkPageReadStore.getPageReader(ColumnChunkPageReadStore.java:160) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.buildVectorizedParquetReader(VectorizedParquetRecordReader.java:479) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:432) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:393) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:345) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:88) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:167) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:52) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:229) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:142) > ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199) > ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?] > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185) > ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?] > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) > ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?] > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459) > ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?] > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?] > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271) > ~[hadoop-mapreduce-client-common-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_121] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[?:1.8.0_121] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ~[?:1.8.0_121] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > ~[?:1.8.0_121] > at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_121] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)