suryaprasanna commented on issue #4891: URL: https://github.com/apache/hudi/issues/4891#issuecomment-1073287297
[FelixKJose](https://github.com/FelixKJose) > 1. Let's say my each partitions (date) are large partitions (eg. 6.5 TB uncompressed data), so having the frequent async clustering is suggested right? I am running on r5.4xlarge (meaning 37GB driver memory), so what will be best clusering frequency? You can start with one spark per partition(so it creates one replacecommit for one sorting operation on a partition) and keep increasing no. of partitions to cluster in a single job, to find out the breaking point. I think with the above driver memory it can easily handle 4 partitions. You need to play around with your data to figure out the amount of parallelism you can give. Although locking, archival or other services will be bottleneck when you run clustering with very high parallelism. **Note:** Make `"hoodie.clustering.async.max.commits"` to `"0"`, that way multiple clustering plans can be generated in parallel. Since the clustering jobs are running on different partitions you should be ok. > 2. What will be the best value for hoodie.clustering.plan.strategy.small.file.limit? Also any other configurations I should be using considering the partition size as mentioned above Since, you are using `"hoodie.clustering.plan.strategy.sort.columns"` config, I am assuming you want to sort the partitions. Sorting operation main objective is to sort the data based on columns and create new set of files with parquet file sizes close to value, that is given under `hoodie.clustering.plan.strategy.target.file.max.bytes`. So, you should not worry about small.file.limit, since sorting operation is anyway going to rewrite entire partition and create larger parquet files. I would suggest to keep the small.file.limit value to be higher that way all the files are included. `hoodie.clustering.plan.strategy.small.file.limit` is mainly used for stitching operation. Where you are not sorting data but stitching small files together so that you can reduce the small file limit. > 3. Which lock provider is advised if I am running on AWS EMR? I do not have much knowledge about AWS stack, default lock provider i.e. ZookeeperBasedLockProvider works just fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
