[GitHub] [hudi] suryaprasanna commented on issue #4891: Clustering not working on large table and partitions

GitBox Sun, 20 Mar 2022 09:37:55 -0700


suryaprasanna commented on issue #4891:
URL: https://github.com/apache/hudi/issues/4891#issuecomment-1073287297



   [FelixKJose](https://github.com/FelixKJose)
   > 1. Let's say my each partitions (date) are large partitions (eg. 6.5 TB 
uncompressed data), so having the frequent async clustering is suggested right? 
I am running on r5.4xlarge (meaning 37GB driver memory), so what will be best 
clusering frequency?
   
   You can start with one spark per partition(so it creates one replacecommit 
for one sorting  operation on a partition) and keep increasing no. of 
partitions to cluster in a single job,  to find out the breaking point. I think 
with the above driver memory it can easily handle 4 partitions. 
   You need to play around with your data to figure out the amount of 
parallelism you can give.
   Although locking, archival or other services will be bottleneck when you run 
clustering with very high parallelism. 
   
   **Note:** Make `"hoodie.clustering.async.max.commits"` to `"0"`, that way 
multiple clustering plans can be generated in parallel. Since the clustering 
jobs are running on different partitions you should be ok. 
   
   > 2. What will be the best value for 
hoodie.clustering.plan.strategy.small.file.limit?
   Also any other configurations I should be using considering the partition 
size as mentioned above
   
   Since, you are using `"hoodie.clustering.plan.strategy.sort.columns"` 
config, I am assuming you want to sort the partitions. Sorting operation main 
objective is to sort the data based on columns and create new set of files with 
parquet file sizes close to value, that is given under 
`hoodie.clustering.plan.strategy.target.file.max.bytes`. So, you should not 
worry about small.file.limit, since sorting operation is anyway going to 
rewrite entire partition and create larger parquet files. I would suggest to 
keep the small.file.limit value to be higher that way all the files are 
included.
   
   `hoodie.clustering.plan.strategy.small.file.limit` is mainly used for 
stitching operation. Where you are not sorting data but stitching small files 
together so that you can reduce the small file limit.
   
   > 3. Which lock provider is advised if I am running on AWS EMR?
   
   I do not have much knowledge about AWS stack, default lock provider i.e. 
ZookeeperBasedLockProvider works just fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] suryaprasanna commented on issue #4891: Clustering not working on large table and partitions

Reply via email to