[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

GitBox Sun, 25 Sep 2022 19:47:58 -0700


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1257404461


   ### Test1
   4 flat columns
   ```bash
   --num-executors 64 \
       --driver-memory 20g \
       --driver-cores 1 \
       --executor-memory 20g \ # rowEnable: 10g
       --executor-cores 1 \
       --class org.apache.hudi.utilities.HoodieClusteringJob \
       $PWD/../hudi-utilities-slim-bundle_2.12-0.13.0-SNAPSHOT.jar \
       --mode scheduleAndExecute \
       --base-path $TABLEPATH \
       --table-name $TABLENAME \
       --spark-memory 20g \ # rowEnable: 10g
       --parallelism 64 \
       --hoodie-conf hoodie.clustering.async.enabled=true \
       --hoodie-conf hoodie.clustering.async.max.commits=0 \
       --hoodie-conf 
hoodie.clustering.plan.strategy.max.bytes.per.group=5368709120 \
       --hoodie-conf 
hoodie.clustering.plan.strategy.target.file.max.bytes=6442450944 \
       --hoodie-conf 
hoodie.clustering.plan.strategy.small.file.limit=1073741824 \
       --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=10000000 \
   ```
   
   Row enabled | Partition hour | total size | file num | runtime
   -- | -- | -- | -- | --
   true | dt=2022-09-22/hh=14 | 233.9 G | 1.3 K | 753s
   false | dt=2022-09-22/hh=22 | 209.6 G | 1.3 K| 1008s
   
   ### Test2
   23 columns, 9 nested columns, using z-order
   ```bash
   --conf 'spark.sql.parquet.columnarReaderBatchSize=2048' \
       --conf 'spark.yarn.maxAppAttempts=1' \
       --num-executors 32 \
       --driver-memory 20g \
       --driver-cores 1 \
       --executor-memory 30g \
       --executor-cores 1 \
       --class org.apache.hudi.utilities.HoodieClusteringJob \
       $PWD/../hudi-utilities-slim-bundle_2.12-0.13.0-SNAPSHOT.jar \
       --mode scheduleAndExecute \
       --base-path $TABLEPATH \
       --table-name $TABLENAME \
       --spark-memory 30g \
       --parallelism 32 \
       --hoodie-conf hoodie.clustering.async.enabled=true \
       --hoodie-conf hoodie.clustering.async.max.commits=0 \
       --hoodie-conf 
hoodie.clustering.plan.strategy.target.file.max.bytes=209715200 \
       --hoodie-conf 
hoodie.clustering.plan.strategy.small.file.limit=1073741824 \
       --hoodie-conf 
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
 \
       --hoodie-conf 
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
 \
       --hoodie-conf hoodie.layout.optimize.enable=true \
       --hoodie-conf hoodie.layout.optimize.strategy=z-order \
       --hoodie-conf 
hoodie.clustering.plan.strategy.sort.columns=applicationId,sparkUser
   ```
   
   Row enabled | Partition hour | total size | file num | runtime
   -- | -- | -- | -- | --
   true | 2022-09-19 | 70.9 G | 7.5 K | 11h 7min
   false | 2022-09-20 | 69.7 G | 7.3 K| 11h 33min
   
   The computing performance improved 20% to 30%, the bottleneck of this job is 
writing data, both jobs take approximate 10 hours at writing stage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Reply via email to