TheR1sing3un opened a new pull request, #13351:
URL: https://github.com/apache/hudi/pull/13351

   Our current logic is that we will disguise all the files under each 
partition as a `PartitionDirectory` together. In order to enable the task to 
know the files that really need to be read, we have also put the collection of 
all `FileSlice` under this partition into the `PartitionValue`. 
   It is convenient to find the corresponding file slice to be read from the 
file slice mapping set in the `PartitionValue` when each subsequent task is 
executed and read. 
   However, I found that when the number of files in one partition increases, 
for example, when there are tens of thousands of files in one partition, the 
file slices in the `PartitionValue` will be 100MB+ in size. And when spark 
creates reading tasks, it needs to pass this mapping of `FileSlice` to each 
task. Therefore, under our default configuration, it will lead to the failure 
of job. Moreover, for each task, it only cares about the `FileSlice` it needs 
to read and does not need to pass all the `FileSlice` under the partition to 
it. 
   Therefore, I optimized the above logic. I will only pass the `FileSlice` 
object that each reading task needs to read, successively reducing the invalid 
broadcast overhead of task creation.
   
   ### Change Logs
   
   1. Avoid broadcasting unnecessary `FileSlice` when reading
   
   ### Impact
   
   improve query stability
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to