WilliamWhispell commented on issue #2265:
URL: https://github.com/apache/hudi/issues/2265#issuecomment-830465596


   Then on the above, if we start pyspark with --conf 
'spark.hadoop.parquet.avro.write-old-list-structure=false'
   
   when we write hudi with the same steps as above we get:
   >>> 
spark_df.write.format("org.apache.hudi").mode("append").option("hoodie.table.name",
 "apple").option("hoodie.datasource.write.precombine.field", 
"hudi_key").option("hoodie.datasource.write.recordkey.field", 
"hudi_key").option("spark.hadoop.parquet.avro.write-old-list-structure", 
"false").option("parquet.avro.write-old-list-structure", 
"false").option("hoodie.parquet.avro.write-old-list-structure", 
"false").save("/home/jovyan/apple.parquet")
   21/04/30 23:54:57 ERROR HoodieCreateHandle: Error writing record 
HoodieRecord{key=HoodieKey { recordKey=5 partitionPath=default}, 
currentLocation='null', newLocation='null'}
   java.lang.ClassCastException: repeated binary array (UTF8) is not a group
           at org.apache.parquet.schema.Type.asGroupType(Type.java:207)
           at 
org.apache.parquet.avro.AvroWriteSupport$ThreeLevelListWriter.writeCollection(AvroWriteSupport.java:610)
           at 
org.apache.parquet.avro.AvroWriteSupport$ListWriter.writeList(AvroWriteSupport.java:395)
           at 
org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:355)
           at 
org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)
           at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
           at 
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
           at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
           at 
org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
           at 
org.apache.hudi.io.storage.HoodieParquetWriter.writeAvroWithMetadata(HoodieParquetWriter.java:83)
           at 
org.apache.hudi.io.HoodieCreateHandle.write(HoodieCreateHandle.java:118)
           at 
org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:163)
           at 
org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:96)
           at 
org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:40)
           at 
org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
           at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
           at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
           at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
           at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
           at java.base/java.lang.Thread.run(Thread.java:834)
   21/04/30 23:54:57 ERROR HoodieCreateHandle: Error writing record 
HoodieRecord{key=HoodieKey { recordKey=1 partitionPath=default}, 
currentLocation='null', newLocation='null'}
   java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
           at 
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.startGroup(MessageColumnIO.java:395)
           at 
org.apache.parquet.avro.AvroWriteSupport$ListWriter.writeList(AvroWriteSupport.java:393)
           at 
org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:355)
           at 
org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)
           at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
   ....
   
   
   So it is looking like even without nulls in an array, we cannot get the 
3-level list format? 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to