xglv1985 opened a new issue #2812:
URL: https://github.com/apache/hudi/issues/2812
when I ran a Spark 2.4 job to incremental query a MOR table, the job failed
with the following errors:
`2021-04-13 12:39:54 [Executor task launch worker for task 215] ERROR
[Executor:91]: Exception in task 160.2 in stage 0.0 (TID 215)
java.io.IOException: can not read class
org.apache.parquet.format.PageHeader: null
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
at
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:936)
at
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950)
at
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at
org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.hasNext(HoodieMergeOnReadRDD.scala:217)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: shaded.parquet.org.apache.thrift.transport.TTransportException
at
shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at
shaded.parquet.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readBinary(TCompactProtocol.java:709)
at
org.apache.parquet.format.InterningProtocol.readBinary(InterningProtocol.java:223)
at
shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:102)
at
shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:138)
at
shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:130)
at
shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
at
org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1047)
at
org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966)
at org.apache.parquet.format.PageHeader.read(PageHeader.java:843)
at org.apache.parquet.format.Util.read(Util.java:213)
... 25 more`
and
`2021-04-13 12:40:07 [Executor task launch worker for task 1113] ERROR
[Executor:91]: Exception in task 206.4 in stage 0.0 (TID 1113)
java.io.IOException: can not read class
org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size'
was not found in serialized data! Struct:
org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@ec35f50
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
at
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:936)
at
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950)
at
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at
org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.hasNext(HoodieMergeOnReadRDD.scala:217)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException:
Required field 'uncompressed_page_size' was not found in serialized data!
Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@ec35f50
at
org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1055)
at
org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966)
at org.apache.parquet.format.PageHeader.read(PageHeader.java:843)
at org.apache.parquet.format.Util.read(Util.java:213)`
my scala code is as follows:
`def main(args: Array[String]) {
val spark = SparkSession.builder().getOrCreate()
val conf = spark.conf
val input = conf.get(Constants.SPARK_INPUT)
val sql = conf.get(Constants.SPARK_SQL_QUERY)
val df = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, "20210413081500").
load(input)
df.createOrReplaceTempView("hudi_materials_incremental")
spark.sql(sql).show()
df.printSchema()
}`
In case it was caused by parquet file itself, I did an experiment that I
changed the code spark.read.format("hudi") to spark.read.format("parquet"),
removed the next two lines with hudi options, and re-ran the job. I got no
error at all.
So what on earth caused the error? How should I do with my code or
configuration, to eliminate the error? Thanks!
**Environment Description**
* Hudi version : 0.8.0
* Spark version : 2.4
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) : HDFS
* Running on Docker? (yes/no) :
**Additional context**
Add any other context about the problem here.
**Stacktrace**
```Add the stacktrace of the error.```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]