[GitHub] [hudi] ravs11 commented on issue #4609: [SUPPORT] Got exception while using clustering with z-order

GitBox Sun, 16 Jan 2022 04:58:46 -0800


ravs11 commented on issue #4609:
URL: https://github.com/apache/hudi/issues/4609#issuecomment-1013871440



   Hi @xiarixiaoyao , let me try to share some more details on this.
   
   **Hudi Table Schema**
   ```
   CREATE TABLE dev.hudi_z_order_test (
   product_id INT, 
   product_name STRING, 
   product_category STRING, 
   create_time BIGINT,
   utc_date STRING)
   USING hudi
   OPTIONS (
     `serialization.format` '1',
     path 'hdfs://R2/tmp/ravs11/hudi_z_order'
   )
   PARTITIONED BY (utc_date)
   ```
   **Hudi Data Ingestion Logic**
   ```
   val df = spark.read.parquet(s"/tmp/ravs11/hudi_input/utc_date=$utcDate")
   val savePath = s"hdfs://R2/tmp/ravs11/hudi_z_order"
   df.write.format("org.apache.hudi")
         .option("hoodie.table.name", s"dev.hudi_z_order_test")
         .option("hoodie.datasource.write.table.name", s"dev.hudi_z_order_test")
         .option("hoodie.datasource.write.operation", "bulk_insert")
         .option("hoodie.sql.insert.mode", "non-strict")
         .option("hoodie.datasource.write.precombine.field", "create_time")
         .option("hoodie.datasource.write.recordkey.field", "product_id")
         .option("hoodie.datasource.write.partitionpath.field", "utc_date")
         .option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.SimpleKeyGenerator")
         .option("hoodie.datasource.write.hive_style_partitioning", "true")
         .option("hoodie.bulkinsert.shuffle.parallelism", "2000")
         .option("hoodie.bulkinsert.sort.mode", "NONE")
         .option("hoodie.parquet.compression.codec", "zstd")
         .option("hoodie.clustering.inline", "true")
         .option("hoodie.clustering.inline.max.commits", "1")
         .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
"1073741824")
         .option("hoodie.clustering.plan.strategy.small.file.limit", 
"536870912")
         .option("hoodie.clustering.plan.strategy.sort.columns", 
"product_name,product_category")
         .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
Long.MaxValue.toString)
         .option("hoodie.layout.optimize.enable", "true")
         .option("hoodie.layout.optimize.strategy", "z-order")
         .mode(SaveMode.Append).save(savePath)
   ```
   **Steps to reproduce**
   
   1. Run the above spark job for utc_date=2021-12-24.
   Content of Input Parquet file 
(/tmp/ravs11/hudi_input/utc_date=2021-12-24/xxx.parquet) -
   +----------+------------+----------------+-------------+----------+
   |product_id|product_name|product_category|create_time  |utc_date  |
   +----------+------------+----------------+-------------+----------+
   |123       |laptop      |electronics     |1671881778000|2021-12-24|
   +----------+------------+----------------+-------------+----------+ 
   
   This job gets completed **successfully**.
   
   2. Run the above spark job for utc_date=2021-12-25. 
   Content of Input Parquet file 
(/tmp/ravs11/hudi_input/utc_date=2021-12-25/yyy.parquet) - 
   +----------+------------+----------------+-------------+----------+
   |product_id|product_name|product_category|create_time  |utc_date  |
   +----------+------------+----------------+-------------+----------+
   |456       |tshirt      |mens wear       |1671968178000|2021-12-25|
   +----------+------------+----------------+-------------+----------+
   
   This job **fails** with exception that was mentioned earlier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ravs11 commented on issue #4609: [SUPPORT] Got exception while using clustering with z-order

Reply via email to