TheR1sing3un opened a new pull request, #13351: URL: https://github.com/apache/hudi/pull/13351
Our current logic is that we will disguise all the files under each partition as a `PartitionDirectory` together. In order to enable the task to know the files that really need to be read, we have also put the collection of all `FileSlice` under this partition into the `PartitionValue`. It is convenient to find the corresponding file slice to be read from the file slice mapping set in the `PartitionValue` when each subsequent task is executed and read. However, I found that when the number of files in one partition increases, for example, when there are tens of thousands of files in one partition, the file slices in the `PartitionValue` will be 100MB+ in size. And when spark creates reading tasks, it needs to pass this mapping of `FileSlice` to each task. Therefore, under our default configuration, it will lead to the failure of job. Moreover, for each task, it only cares about the `FileSlice` it needs to read and does not need to pass all the `FileSlice` under the partition to it. Therefore, I optimized the above logic. I will only pass the `FileSlice` object that each reading task needs to read, successively reducing the invalid broadcast overhead of task creation. ### Change Logs 1. Avoid broadcasting unnecessary `FileSlice` when reading ### Impact improve query stability ### Risk level (write none, low medium or high below) low ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
