boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1257404461
### Test1
4 flat columns
```bash
--num-executors 64 \
--driver-memory 20g \
--driver-cores 1 \
--executor-memory 20g \ # rowEnable: 10g
--executor-cores 1 \
--class org.apache.hudi.utilities.HoodieClusteringJob \
$PWD/../hudi-utilities-slim-bundle_2.12-0.13.0-SNAPSHOT.jar \
--mode scheduleAndExecute \
--base-path $TABLEPATH \
--table-name $TABLENAME \
--spark-memory 20g \ # rowEnable: 10g
--parallelism 64 \
--hoodie-conf hoodie.clustering.async.enabled=true \
--hoodie-conf hoodie.clustering.async.max.commits=0 \
--hoodie-conf
hoodie.clustering.plan.strategy.max.bytes.per.group=5368709120 \
--hoodie-conf
hoodie.clustering.plan.strategy.target.file.max.bytes=6442450944 \
--hoodie-conf
hoodie.clustering.plan.strategy.small.file.limit=1073741824 \
--hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=10000000 \
```
Row enabled | Partition hour | total size | file num | runtime
-- | -- | -- | -- | --
true | dt=2022-09-22/hh=14 | 233.9 G | 1.3 K | 753s
false | dt=2022-09-22/hh=22 | 209.6 G | 1.3 K| 1008s
### Test2
23 columns, 9 nested columns, using z-order
```bash
--conf 'spark.sql.parquet.columnarReaderBatchSize=2048' \
--conf 'spark.yarn.maxAppAttempts=1' \
--num-executors 32 \
--driver-memory 20g \
--driver-cores 1 \
--executor-memory 30g \
--executor-cores 1 \
--class org.apache.hudi.utilities.HoodieClusteringJob \
$PWD/../hudi-utilities-slim-bundle_2.12-0.13.0-SNAPSHOT.jar \
--mode scheduleAndExecute \
--base-path $TABLEPATH \
--table-name $TABLENAME \
--spark-memory 30g \
--parallelism 32 \
--hoodie-conf hoodie.clustering.async.enabled=true \
--hoodie-conf hoodie.clustering.async.max.commits=0 \
--hoodie-conf
hoodie.clustering.plan.strategy.target.file.max.bytes=209715200 \
--hoodie-conf
hoodie.clustering.plan.strategy.small.file.limit=1073741824 \
--hoodie-conf
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
\
--hoodie-conf
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
\
--hoodie-conf hoodie.layout.optimize.enable=true \
--hoodie-conf hoodie.layout.optimize.strategy=z-order \
--hoodie-conf
hoodie.clustering.plan.strategy.sort.columns=applicationId,sparkUser
```
Row enabled | Partition hour | total size | file num | runtime
-- | -- | -- | -- | --
true | 2022-09-19 | 70.9 G | 7.5 K | 11h 7min
false | 2022-09-20 | 69.7 G | 7.3 K| 11h 33min
The computing performance improved 20% to 30%, the bottleneck of this job is
writing data, both jobs take approximate 10 hours at writing stage.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]