TheR1sing3un opened a new pull request, #13070: URL: https://github.com/apache/hudi/pull/13070
close to: https://github.com/apache/hudi/issues/12139 When dealing with `HadoopFsRelation`, Spark merges `PartitionedFile` based on data such as file size. At present, we directly use the base file or a random log file as the `PartitionedFile` of the `FileSlice`. As a result, spark cannot accurately use representative data when merging. Therefore, I estimated the size of the entire `FileSlice` file if it is converted into parqeut file. Using this data to represent the file slice can provide more accurate data for spark to optimize. ### Change Logs 1. Introduce a representative file containing the estimated total size of file slice ### Impact Reduces task tilt when reading ### Risk level (write none, low medium or high below) low ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org