srsteinmetz commented on issue #1830: URL: https://github.com/apache/hudi/issues/1830#issuecomment-659522682
I just started a new test and will watch for increasing processing times and post the associated .commit file. My more recent tests have used 36 partitions and if I inspect the partitions they generally have ~1000 .log files @ ~3.8MB each and ~700 .parquet files @ 120 MB. Of those 700 .parquet files each file slice seems to retain 8 versions so the partitions have ~88 unique file slices. I have inlineComapction enabled but the load generator is currently sending updates at 10k TPS which probably results in the large number of .log files. The table we are attempting to model has 150 TB of storage in DynamoDB. Is there any general rule of thumb around the number of partitions to use and ideal number of parquet files per partition? We may be able to work backwards from the size of the source table to determine the proper number of partitions and maxFileSize. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
