ravs11 commented on issue #4609: URL: https://github.com/apache/hudi/issues/4609#issuecomment-1013871440
Hi @xiarixiaoyao , let me try to share some more details on this. **Hudi Table Schema** ``` CREATE TABLE dev.hudi_z_order_test ( product_id INT, product_name STRING, product_category STRING, create_time BIGINT, utc_date STRING) USING hudi OPTIONS ( `serialization.format` '1', path 'hdfs://R2/tmp/ravs11/hudi_z_order' ) PARTITIONED BY (utc_date) ``` **Hudi Data Ingestion Logic** ``` val df = spark.read.parquet(s"/tmp/ravs11/hudi_input/utc_date=$utcDate") val savePath = s"hdfs://R2/tmp/ravs11/hudi_z_order" df.write.format("org.apache.hudi") .option("hoodie.table.name", s"dev.hudi_z_order_test") .option("hoodie.datasource.write.table.name", s"dev.hudi_z_order_test") .option("hoodie.datasource.write.operation", "bulk_insert") .option("hoodie.sql.insert.mode", "non-strict") .option("hoodie.datasource.write.precombine.field", "create_time") .option("hoodie.datasource.write.recordkey.field", "product_id") .option("hoodie.datasource.write.partitionpath.field", "utc_date") .option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.SimpleKeyGenerator") .option("hoodie.datasource.write.hive_style_partitioning", "true") .option("hoodie.bulkinsert.shuffle.parallelism", "2000") .option("hoodie.bulkinsert.sort.mode", "NONE") .option("hoodie.parquet.compression.codec", "zstd") .option("hoodie.clustering.inline", "true") .option("hoodie.clustering.inline.max.commits", "1") .option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824") .option("hoodie.clustering.plan.strategy.small.file.limit", "536870912") .option("hoodie.clustering.plan.strategy.sort.columns", "product_name,product_category") .option("hoodie.clustering.plan.strategy.max.bytes.per.group", Long.MaxValue.toString) .option("hoodie.layout.optimize.enable", "true") .option("hoodie.layout.optimize.strategy", "z-order") .mode(SaveMode.Append).save(savePath) ``` **Steps to reproduce** 1. Run the above spark job for utc_date=2021-12-24. Content of Input Parquet file (/tmp/ravs11/hudi_input/utc_date=2021-12-24/xxx.parquet) - +----------+------------+----------------+-------------+----------+ |product_id|product_name|product_category|create_time |utc_date | +----------+------------+----------------+-------------+----------+ |123 |laptop |electronics |1671881778000|2021-12-24| +----------+------------+----------------+-------------+----------+ This job gets completed **successfully**. 2. Run the above spark job for utc_date=2021-12-25. Content of Input Parquet file (/tmp/ravs11/hudi_input/utc_date=2021-12-25/yyy.parquet) - +----------+------------+----------------+-------------+----------+ |product_id|product_name|product_category|create_time |utc_date | +----------+------------+----------------+-------------+----------+ |456 |tshirt |mens wear |1671968178000|2021-12-25| +----------+------------+----------------+-------------+----------+ This job **fails** with exception that was mentioned earlier. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org