codope commented on issue #4891:
URL: https://github.com/apache/hudi/issues/4891#issuecomment-1058149325


   > How does normally people perform inline clustering or async clustering on 
partitions with large amount of data? Do you expect driver memory should be 
larger than the clustering data size?
   
   Currently, [the 
union](https://github.com/apache/hudi/blob/a4ba0fff07861bdff338074fb5d84726f430883f/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java#L108)
 will collect hudi records (as per the plan generated for corresponding 
replacecommit) into the driver.
   
   > What are the configurations I should be using to perform clustering on 
these large tables?
   > In PROD I will have 1.8 billion records (each record 3-4 KB in memory) on 
each day, so is it advised to perform clustering frequently (every 10 to 20 
commits) or daily?
   
   As you said, you could run clustering more frequently. Configs will also 
depend on clustering strategy. Looking at your configs, you can adjust the 
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` to a lower 
value. You could also tune `hoodie.clustering.plan.strategy.small.file.limit` 
to a value such that even though each round of clustering is picking just a few 
partitions but it is able to utilize the memory as much as possible. What's the 
nature of your workload? Is it like append-only, or frequent upserts but to 
more recent partitions? 
   
   > Does MOR table supports async clustering with OCC assurance?
   
   Yes, if the lock provider is configured.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to