[ https://issues.apache.org/jira/browse/HIVE-26612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620440#comment-17620440 ]
Stamatis Zampetakis commented on HIVE-26612: -------------------------------------------- It is not my intention to prove that the customer is right or wrong but rather clarify if there is a bug and where it is. When there are multiple projects involved in a problem (in this case Spark vs Hive) it is important to understand which side is causing the problem. If there is a change in the way Spark writes the Parquet file then this could also be causing the exceptions mentioned here. The Hive Parquet documentation (https://cwiki.apache.org/confluence/display/Hive/Parquet) is very sketchy leaving a lot of open questions on what exactly is supported and how things are supposed to work. This ticket as well as HIVE-23345 present the fact that Hive cannot read a Parquet TIMESTAMP into a Hive BIGINT as a Hive bug but there were no tests and no documentation implying that is possible. In these cases, there is a fine line between bug and feature request. Another reason why I wanted to know the commit which caused the breaking change in Hive is to understand if it was intentional or not. Running git bisect with the test case in the PR shows that the Hive commit which broke this use-case is HIVE-21215. Note, that if the Logical type was missing from the metadata then things would work as before without problems. Now I have a better picture of what is happening and it seems reasonable to fix this; I will try to have a look in the PR in the next few days. > Hive cannot read parquet files with int64 (TIMESTAMP_MILLIS) > ------------------------------------------------------------ > > Key: HIVE-26612 > URL: https://issues.apache.org/jira/browse/HIVE-26612 > Project: Hive > Issue Type: Bug > Components: Database/Schema > Reporter: Steve Carlin > Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > If a parquet file has a Type of "int64 eventtime (TIMESTAMP(MILLIS,true))", > the following error is produced: > {noformat} > java.lang.RuntimeException: java.io.IOException: > org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in > block 0 in file > file:/xxxx/hive/itests/qtest/target/tmp/parquet_format_ts_as_bigint/part-00000/timestamp_as_bigint.parquet > at > org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:213) > at org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:98) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149) > Caused by: java.io.IOException: > org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in > block 0 in file > file:/xxxx/hive/itests/qtest/target/tmp/parquet_format_ts_as_bigint/part-00000/timestamp_as_bigint.parquet > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:624) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:531) > at > org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:197) > ... 55 more > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > file:/home/stamatis/Projects/Apache/hive/itests/qtest/target/tmp/parquet_format_ts_as_bigint/part-00000/timestamp_as_bigint.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:255) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:87) > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89) > at > org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:771) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:335) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:562) > ... 57 more > Caused by: java.lang.UnsupportedOperationException: > org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$10$1 > at > org.apache.parquet.io.api.PrimitiveConverter.addLong(PrimitiveConverter.java:105) > at > org.apache.parquet.column.impl.ColumnReaderBase$2$4.writeValue(ColumnReaderBase.java:301) > at > org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:410) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230) > ... 63 more > {noformat} > The parquet file can be created with the following steps (through spark): > spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS") > spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY") > spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") > spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") > spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") > [1] > val df = Seq( > (1, Timestamp.valueOf("2014-01-01 23:00:01")), > (1, Timestamp.valueOf("2014-11-30 12:40:32")), > (2, Timestamp.valueOf("2016-12-29 09:54:00")), > (2, Timestamp.valueOf("2016-05-09 10:12:43")) > ).toDF("typeid","eventtime") > [2] > [root@c4839-node3 test_parquet2]# parquet-tools schema > part-00001-6c90b794-90b9-4cc0-afc5-2e49a4e96bad-c000.snappy.parquet > message spark_schema { > required int32 typeid; > optional int64 eventtime (TIMESTAMP(MILLIS,true)); > } > [3] > [root@c4839-node3 test_parquet1]# parquet-tools schema > part-00001-cb1aeebb-ec87-4273-82ec-911c4fb605b6-c000.snappy.parquet > message spark_schema { > required int32 typeid; > optional int96 eventtime; > } -- This message was sent by Atlassian Jira (v8.20.10#820010)