[jira] [Commented] (HIVE-22670) ArrayIndexOutOfBoundsException when vectorized reader is used for reading a parquet file

Ganesha Shreedhara (Jira) Thu, 05 Mar 2020 02:06:30 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-22670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051973#comment-17051973
 ]


Ganesha Shreedhara commented on HIVE-22670:
-------------------------------------------

[~pvary] The datafile I have has confidential information, so it wouldn't be 
possible to add this data file. I'll have to check if it would be possible to 
generate the test data having the same distribution of encoded data..

> ArrayIndexOutOfBoundsException when vectorized reader is used for reading a 
> parquet file
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-22670
>                 URL: https://issues.apache.org/jira/browse/HIVE-22670
>             Project: Hive
>          Issue Type: Bug
>          Components: Parquet, Vectorization
>    Affects Versions: 3.1.2, 2.3.6
>            Reporter: Ganesha Shreedhara
>            Assignee: Ganesha Shreedhara
>            Priority: Major
>         Attachments: HIVE-22670.1.patch
>
>
> ArrayIndexOutOfBoundsException is getting thrown while decoding dictionaryIds 
> of a row group in parquet file with vectorization enabled. 
> *Exception stack trace:*
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>  at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:122)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.ParquetDataColumnReaderFactory$DefaultParquetDataColumnReader.readString(ParquetDataColumnReaderFactory.java:95)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.decodeDictionaryIds(VectorizedPrimitiveColumnReader.java:467)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.readBatch(VectorizedPrimitiveColumnReader.java:68)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:410)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:353)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:92)
>  at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
>  ... 24 more{code}
>  
> This issue seems to be caused by re-using the same dictionary column vector 
> while reading consecutive row groups. This looks like one of the corner case 
> bug which occurs for a certain distribution of dictionary/plain encoded data 
> while we read/populate the underlying bit packed dictionary data into a 
> column-vector based data structure. 
> Similar issue issue was reported in spark (Ref: 
> https://issues.apache.org/jira/browse/SPARK-16334)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22670) ArrayIndexOutOfBoundsException when vectorized reader is used for reading a parquet file

Reply via email to