Rap70r opened a new issue, #5770: URL: https://github.com/apache/hudi/issues/5770
Hello, We are using Apache Hudi to apply COPY_ON_WRITE incremental updates into Parquet files on S3. We have configured the property hoodie.parquet.max.file.size to 18874368 (18 MB). On initial sync, the Hudi table is split into multiple parquet files of size set by this property, under the partition folders. However, after several incremental updates, the files get merged into one large file that exceeds 100MB in size. Here are the configs we are using: ``` parallelism: 200 operation: upsert storageType: COPY_ON_WRITE maxVersions: 1 hoodie.datasource.hive_sync.enable: false hoodie.finalize.write.parallelism: 200 hoodie.parquet.max.file.size: 18874368 ``` **Expected behavior** Hudi table should maintain parquet files of size defined by hoodie.parquet.max.file.size property. Are we missing a specific property that needs to be configured in order to maintain those file sizes? **Environment Description** * Hudi version : 0.11.0 * Spark version : 3.1.2 * Hive version : None * Hadoop version : 3.2.1 * Storage : S3 * Running on Docker? : No Thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
