Shawn Chang created HUDI-9119:
---------------------------------

             Summary: Hudi 1.0.1 cannot write MOR tables
                 Key: HUDI-9119
                 URL: https://issues.apache.org/jira/browse/HUDI-9119
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Shawn Chang


When testing Hudi 1.0.1 on EMR 7.8, I can see issues like below:
{code:java}
Caused by: org.apache.hudi.exception.HoodieException: Exception when reading 
log file   at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scanInternalV1(AbstractHoodieLogRecordScanner.java:388)
  at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scanInternal(AbstractHoodieLogRecordScanner.java:250)
  at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.scanByKeyPrefixes(HoodieMergedLogRecordScanner.java:196)
  at 
org.apache.hudi.metadata.HoodieMetadataLogRecordReader.getRecordsByKeyPrefixes(HoodieMetadataLogRecordReader.java:87)
  at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.readLogRecords(HoodieBackedTableMetadata.java:379)
  at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeyPrefixes$7539c171$1(HoodieBackedTableMetadata.java:234)
  at 
org.apache.hudi.common.function.FunctionWrapper.lambda$throwingMapWrapper$0(FunctionWrapper.java:38)
  ... 39 moreCaused by: java.lang.ClassCastException: class 
org.apache.avro.generic.GenericData$Record cannot be cast to class 
org.apache.hudi.avro.model.HoodieDeleteRecordList 
(org.apache.avro.generic.GenericData$Record is in unnamed module of loader 
'app'; org.apache.hudi.avro.model.HoodieDeleteRecordList is in unnamed module 
of loader org.apache.spark.util.MutableURLClassLoader @5b2ea718)  at 
org.apache.hudi.common.table.log.block.HoodieDeleteBlock.deserialize(HoodieDeleteBlock.java:169)
  at 
org.apache.hudi.common.table.log.block.HoodieDeleteBlock.getRecordsToDelete(HoodieDeleteBlock.java:124)
  at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processQueuedBlocksForInstant(AbstractHoodieLogRecordScanner.java:678)
  at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scanInternalV1(AbstractHoodieLogRecordScanner.java:378)
  ... 45 more 


{code}
Reproduction steps:
 # Start a EMR 7.8 cluster
 # Start spark-shell with the command below
 # 
{code:java}
spark-shell \--packages org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1 \--conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' \--conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 \--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
\--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar' 

{code}
Run the script below:

 # 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.spark.sql.SaveMode
val df1 = Seq( (100, "2015-01-01", "event_name_900", 
"2015-01-01T13:51:39.340396Z", "type1"), (101, "2015-01-01", "event_name_546", 
"2015-01-01T12:14:58.597216Z", "type2"), (102, "2015-01-01", "event_name_345", 
"2015-01-01T13:51:40.417052Z", "type3"), (103, "2015-01-01", "event_name_234", 
"2015-01-01T13:51:40.519832Z", "type4"), (104, "2015-01-01", "event_name_123", 
"2015-01-01T12:15:00.512679Z", "type1"), (105, "2015-01-01", "event_name_678", 
"2015-01-01T13:51:42.248818Z", "type2"), (106, "2015-01-01", "event_name_890", 
"2015-01-01T13:51:44.735360Z", "type3"), (107, "2015-01-01", "event_name_944", 
"2015-01-01T13:51:45.019544Z", "type4"), (108, "2015-01-01", "event_name_456", 
"2015-01-01T13:51:45.208007Z", "type1"), (109, "2015-01-01", "event_name_567", 
"2015-01-01T13:51:45.369689Z", "type2"), (110, "2015-01-01", "event_name_789", 
"2015-01-01T12:15:05.664947Z", "type3"), (111, "2015-01-01", "event_name_322", 
"2015-01-01T13:51:47.388239Z", "type4") ).toDF("event_id", "event_date", 
"event_name", "event_ts", "event_type")

val r = scala.util.Random
val num =  r.nextInt(99999)
var tableName = "yxchang_hudi_cow_simple_14_" + num
var tablePath = "s3://<yourbucket>/hudi10/" + tableName + "/"

df1.write.format("hudi") 
.option("hoodie.metadata.enable", "true")
.option("hoodie.table.name", tableName)
.option("hoodie.datasource.write.operation", "insert") // use insert 
.option("hoodie.datasource.write.table.type", "COPY_ON_WRITE")
.option("hoodie.datasource.write.recordkey.field", "event_id,event_date") 
.option("hoodie.datasource.write.partitionpath.field", "event_type")  
.option("hoodie.datasource.write.precombine.field", "event_ts") 
.option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.ComplexKeyGenerator") 
.option("hoodie.datasource.hive_sync.enable", "true") 
.option("hoodie.datasource.meta.sync.enable", "true") 
.option("hoodie.datasource.hive_sync.mode", "hms")
.option("hoodie.datasource.hive_sync.table", tableName) 
.option("hoodie.datasource.hive_sync.partition_fields", "event_type") 
.option("hoodie.datasource.hive_sync.partition_extractor_class", 
"org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Append) 
.save(tablePath) {code}
In the script above, I used a COW table with MDT enabled, which can also 
reproduce the issue.

 

Additional context:
 # This exception looks like the same as 
[https://github.com/apache/hudi/issues/10609]
 # The same script won't have issue when using OSS Hudi 1.0.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to