[jira] [Commented] (HIVE-26612) Hive cannot read parquet files with int64 (TIMESTAMP_MILLIS)

Stamatis Zampetakis (Jira) Wed, 19 Oct 2022 09:18:05 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-26612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620440#comment-17620440
 ]


Stamatis Zampetakis commented on HIVE-26612:
--------------------------------------------

It is not my intention to prove that the customer is right or wrong but rather 
clarify if there is a bug and where it is. When there are multiple projects 
involved in a problem (in this case Spark vs Hive) it is important to 
understand which side is causing the problem. If there is a change in the way 
Spark writes the Parquet file then this could also be causing the exceptions 
mentioned here.

The Hive Parquet documentation 
(https://cwiki.apache.org/confluence/display/Hive/Parquet) is very sketchy 
leaving a lot of open questions on what exactly is supported and how things are 
supposed to work. This ticket as well as HIVE-23345 present the fact that Hive 
cannot read a Parquet TIMESTAMP into a Hive BIGINT as a Hive bug but there were 
no tests and no documentation implying that is possible. In these cases, there 
is a fine line between bug and feature request.

Another reason why I wanted to know the commit which caused the breaking change 
in Hive is to understand if it was intentional or not.

Running git bisect with the test case in the PR shows that the Hive commit 
which broke this use-case is HIVE-21215. Note, that if the Logical type was 
missing from the metadata then things would work as before without problems.

Now I have a better picture of what is happening and it seems reasonable to fix 
this; I will try to have a look in the PR in the next few days.

> Hive cannot read parquet files with int64 (TIMESTAMP_MILLIS)
> ------------------------------------------------------------
>
>                 Key: HIVE-26612
>                 URL: https://issues.apache.org/jira/browse/HIVE-26612
>             Project: Hive
>          Issue Type: Bug
>          Components: Database/Schema
>            Reporter: Steve Carlin
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> If a parquet file has a Type of "int64 eventtime (TIMESTAMP(MILLIS,true))", 
> the following error is produced:
> {noformat}
> java.lang.RuntimeException: java.io.IOException: 
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in 
> block 0 in file 
> file:/xxxx/hive/itests/qtest/target/tmp/parquet_format_ts_as_bigint/part-00000/timestamp_as_bigint.parquet
>       at 
> org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:213)
>       at org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:98)
>       at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
> Caused by: java.io.IOException: 
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in 
> block 0 in file 
> file:/xxxx/hive/itests/qtest/target/tmp/parquet_format_ts_as_bigint/part-00000/timestamp_as_bigint.parquet
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:624)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:531)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:197)
>       ... 55 more
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> file:/home/stamatis/Projects/Apache/hive/itests/qtest/target/tmp/parquet_format_ts_as_bigint/part-00000/timestamp_as_bigint.parquet
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:255)
>       at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>       at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:87)
>       at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:771)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:335)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:562)
>       ... 57 more
> Caused by: java.lang.UnsupportedOperationException: 
> org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$10$1
>       at 
> org.apache.parquet.io.api.PrimitiveConverter.addLong(PrimitiveConverter.java:105)
>       at 
> org.apache.parquet.column.impl.ColumnReaderBase$2$4.writeValue(ColumnReaderBase.java:301)
>       at 
> org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:410)
>       at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
>       at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
>       ... 63 more
> {noformat}
> The parquet file can be created with the following steps (through spark):
> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
> spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY")
> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
> spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY")
> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY")
> [1]
> val df = Seq(
> (1, Timestamp.valueOf("2014-01-01 23:00:01")),
> (1, Timestamp.valueOf("2014-11-30 12:40:32")),
> (2, Timestamp.valueOf("2016-12-29 09:54:00")),
> (2, Timestamp.valueOf("2016-05-09 10:12:43"))
> ).toDF("typeid","eventtime")
> [2]
> [root@c4839-node3 test_parquet2]# parquet-tools schema 
> part-00001-6c90b794-90b9-4cc0-afc5-2e49a4e96bad-c000.snappy.parquet
> message spark_schema {
> required int32 typeid;
> optional int64 eventtime (TIMESTAMP(MILLIS,true));
> }
> [3]
> [root@c4839-node3 test_parquet1]# parquet-tools schema 
> part-00001-cb1aeebb-ec87-4273-82ec-911c4fb605b6-c000.snappy.parquet
> message spark_schema {
> required int32 typeid;
> optional int96 eventtime;
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26612) Hive cannot read parquet files with int64 (TIMESTAMP_MILLIS)

Reply via email to