ravs11 commented on issue #4609:
URL: https://github.com/apache/hudi/issues/4609#issuecomment-1013871440


   Hi @xiarixiaoyao , let me try to share some more details on this.
   
   **Hudi Table Schema**
   ```
   CREATE TABLE dev.hudi_z_order_test (
   product_id INT, 
   product_name STRING, 
   product_category STRING, 
   create_time BIGINT,
   utc_date STRING)
   USING hudi
   OPTIONS (
     `serialization.format` '1',
     path 'hdfs://R2/tmp/ravs11/hudi_z_order'
   )
   PARTITIONED BY (utc_date)
   ```
   **Hudi Data Ingestion Logic**
   ```
   val df = spark.read.parquet(s"/tmp/ravs11/hudi_input/utc_date=$utcDate")
   val savePath = s"hdfs://R2/tmp/ravs11/hudi_z_order"
   df.write.format("org.apache.hudi")
         .option("hoodie.table.name", s"dev.hudi_z_order_test")
         .option("hoodie.datasource.write.table.name", s"dev.hudi_z_order_test")
         .option("hoodie.datasource.write.operation", "bulk_insert")
         .option("hoodie.sql.insert.mode", "non-strict")
         .option("hoodie.datasource.write.precombine.field", "create_time")
         .option("hoodie.datasource.write.recordkey.field", "product_id")
         .option("hoodie.datasource.write.partitionpath.field", "utc_date")
         .option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.SimpleKeyGenerator")
         .option("hoodie.datasource.write.hive_style_partitioning", "true")
         .option("hoodie.bulkinsert.shuffle.parallelism", "2000")
         .option("hoodie.bulkinsert.sort.mode", "NONE")
         .option("hoodie.parquet.compression.codec", "zstd")
         .option("hoodie.clustering.inline", "true")
         .option("hoodie.clustering.inline.max.commits", "1")
         .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
"1073741824")
         .option("hoodie.clustering.plan.strategy.small.file.limit", 
"536870912")
         .option("hoodie.clustering.plan.strategy.sort.columns", 
"product_name,product_category")
         .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
Long.MaxValue.toString)
         .option("hoodie.layout.optimize.enable", "true")
         .option("hoodie.layout.optimize.strategy", "z-order")
         .mode(SaveMode.Append).save(savePath)
   ```
   **Steps to reproduce**
   
   1. Run the above spark job for utc_date=2021-12-24.
   Content of Input Parquet file 
(/tmp/ravs11/hudi_input/utc_date=2021-12-24/xxx.parquet) -
   +----------+------------+----------------+-------------+----------+
   |product_id|product_name|product_category|create_time  |utc_date  |
   +----------+------------+----------------+-------------+----------+
   |123       |laptop      |electronics     |1671881778000|2021-12-24|
   +----------+------------+----------------+-------------+----------+ 
   
   This job gets completed **successfully**.
   
   2. Run the above spark job for utc_date=2021-12-25. 
   Content of Input Parquet file 
(/tmp/ravs11/hudi_input/utc_date=2021-12-25/yyy.parquet) - 
   +----------+------------+----------------+-------------+----------+
   |product_id|product_name|product_category|create_time  |utc_date  |
   +----------+------------+----------------+-------------+----------+
   |456       |tshirt      |mens wear       |1671968178000|2021-12-25|
   +----------+------------+----------------+-------------+----------+
   
   This job **fails** with exception that was mentioned earlier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to