[ https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ganesha Shreedhara updated HIVE-25494: -------------------------------------- Labels: schema-evolution (was: ) > Hive query fails with IndexOutOfBoundsException when a struct type column's > field is missing in parquet file schema but present in table schema > ----------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-25494 > URL: https://issues.apache.org/jira/browse/HIVE-25494 > Project: Hive > Issue Type: Bug > Components: Parquet > Affects Versions: 3.1.2 > Reporter: Ganesha Shreedhara > Priority: Major > Labels: schema-evolution > Attachments: test-struct.parquet > > > When a struct type column's field is missing in parquet file schema but > present in table schema and columns are accessed by names, the > requestedSchema getting sent from Hive to Parquet storage layer has type even > for missing field since we always add type as primitive type if a field is > missing in file schema (Ref: > [code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]). > On a parquet side, this missing field gets pruned and since this field > belongs to struct type, it ends creating a GroupColumnIO without any > children. This causes query to fail with IndexOutOfBoundsException, stack > trace is given below. > > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file test-struct.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:98) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60) > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) > at > org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459) > ... 15 more > Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:657) > at java.util.ArrayList.get(ArrayList.java:433) > at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102) > at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102) > at > org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102) > at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97) > at > org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277) > at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135) > at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101) > at > org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) > at > org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214) > {code} > > Steps to reproduce: > > {code:java} > CREATE TABLE parquet_struct_test( > `parent` struct<child:string,extracol:string> COMMENT '', > `toplevel` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; > > -- Use the attached test-struct.parquet data file to load data to this table > LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test; > hive> select parent.extracol, toplevel from parquet_struct_test; > OK > Failed with exception > java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not > read value at 0 in block -1 in file > hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet > {code} > > Same query works fine in the following scenarios: > 1) Accessing parquet file columns by index instead of names > {code:java} > hive> set parquet.column.index.access=true; > hive> select parent.extracol, toplevel from parquet_struct_test; > OK > NULL toplevel{code} > > 2) When VectorizedParquetRecordReader is used > {code:java} > hive> set hive.fetch.task.conversion=none; > hive> select parent.extracol, toplevel from parquet_struct_test; > Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total > jobs = 1 > Launching Job 1 out of 1 > Status: Running (Executing on YARN cluster with App id > application_1630412697229_0031) > ---------------------------------------------------------------------------------------------- > VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING > FAILED > KILLED---------------------------------------------------------------------------------------------- > Map 1 .......... container SUCCEEDED 1 1 0 0 > 0 > 0---------------------------------------------------------------------------------------------- > VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 3.06 > s---------------------------------------------------------------------------------------------- > OK > NULL toplevel{code} > > 3) Create a copy of the same table and run the same query on the newly > created table. > {code:java} > hive> create table parquet_struct_test_copy like parquet_struct_test; > OK > hive> insert into parquet_struct_test_copy select * from parquet_struct_test; > Query ID = hadoop_20210831154709_954d0abf-d713-498e-8696-27fb9c457dc8Total > jobs = 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster > with App id application_1630412697229_0031) > ---------------------------------------------------------------------------------------------- > VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING > FAILED > KILLED---------------------------------------------------------------------------------------------- > Map 1 .......... container SUCCEEDED 1 1 0 0 > 0 > 0---------------------------------------------------------------------------------------------- > VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 3.81 > s---------------------------------------------------------------------------------------------- > Loading data to table default.parquet_struct_test_copy > OK > hive> select parent.extracol, toplevel from parquet_struct_test_copy; > OK > NULL toplevel{code} > > Also, this issue doesn't exist when only missing struct type column's field > is selected or all the fields in table are selected. This issue exists only > when combination of missing struct type column's field and another existing > column are selected. > -- This message was sent by Atlassian Jira (v8.3.4#803005)