cshuo commented on PR #13409:
URL: https://github.com/apache/hudi/pull/13409#issuecomment-3095627234

   > Small files is not good for query performance.
   
   As mentioned 
[above](https://github.com/apache/hudi/pull/13409#discussion_r2149311677), we 
can trigger flushing by buffer memory size and set the size properly to relieve 
the small files pressure. And the current impl seems can't ensure the data is 
ordered in row group level either, since row group is switched when it reaches 
the configured size limit, e.g., default 120Mb currently. 
(`HoodieStorageConfig#PARQUET_BLOCK_SIZE`).
   
   > But if we have whole parquet file with order, we will lose the data 
freshness.
   
   Actually the data freshness is decided by checkpoint interval. The writer 
will flush and commit the written files during checkpoint, until which point 
the data remains invisible.
   
   > Sort time will increase a lot then cause the high back pressure in Flink 
job.
   
   Agree that it will need more sort time to keep whole file ordered. Not sure 
how significant the impact is, I remembered @Alowator has a ingestion benchmark 
which includes sorting of binary buffer 
[here](https://github.com/apache/hudi/pull/12729#issue-2817564527), and said 
`sort performs fast enough so it doesn't affect write performance`, where the 
default batch size is 256Mb to trigger flushing. Maybe you can double check 
that. cc @HuangZhenQiu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to