[PR] [HUDI-9205] Introduce a representative file containing the estimated total size of file slice [hudi]

via GitHub Tue, 01 Apr 2025 22:29:24 -0700


TheR1sing3un opened a new pull request, #13070:
URL: https://github.com/apache/hudi/pull/13070


   close to: https://github.com/apache/hudi/issues/12139
   
   When dealing with `HadoopFsRelation`, Spark merges `PartitionedFile` based 
on data such as file size. At present, we directly use the base file or a 
random log file as the `PartitionedFile` of the `FileSlice`. As a result, spark 
cannot accurately use representative data when merging. Therefore, I estimated 
the size of the entire `FileSlice` file if it is converted into parqeut file. 
Using this data to represent the file slice can provide more accurate data for 
spark to optimize.
   
   ### Change Logs
   
   1. Introduce a representative file containing the estimated total size of 
file slice
   
   
   ### Impact
   
   Reduces task tilt when reading
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-9205] Introduce a representative file containing the estimated total size of file slice [hudi]

Reply via email to