cshuo commented on PR #13409: URL: https://github.com/apache/hudi/pull/13409#issuecomment-3095627234
> Small files is not good for query performance. As mentioned [above](https://github.com/apache/hudi/pull/13409#discussion_r2149311677), we can trigger flushing by buffer memory size and set the size properly to relieve the small files pressure. And the current impl seems can't ensure the data is ordered in row group level either, since row group is switched when it reaches the configured size limit, e.g., default 120Mb currently. (`HoodieStorageConfig#PARQUET_BLOCK_SIZE`). > But if we have whole parquet file with order, we will lose the data freshness. Actually the data freshness is decided by checkpoint interval. The writer will flush and commit the written files during checkpoint, until which point the data remains invisible. > Sort time will increase a lot then cause the high back pressure in Flink job. Agree that it will need more sort time to keep whole file ordered. Not sure how significant the impact is, I remembered @Alowator has a ingestion benchmark which includes sorting of binary buffer [here](https://github.com/apache/hudi/pull/12729#issue-2817564527), and said `sort performs fast enough so it doesn't affect write performance`, where the default batch size is 256Mb to trigger flushing. Maybe you can double check that. cc @HuangZhenQiu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
