JeremyXin opened a new pull request, #8453: URL: https://github.com/apache/seatunnel/pull/8453
<!-- Thank you for contributing to SeaTunnel! Please make sure that your code changes are covered with tests. And in case of new features or big changes remember to adjust the documentation. Feel free to ping committers for the review! ## Contribution Checklist - Make sure that the pull request corresponds to a [GITHUB issue](https://github.com/apache/seatunnel/issues). - Name the pull request in the form "[Feature] [component] Title of the pull request", where *Feature* can be replaced by `Hotfix`, `Bug`, etc. - Minor fixes should be named following this pattern: `[hotfix] [docs] Fix typo in README.md doc`. --> ### Purpose of this pull request <!-- Describe the purpose of this pull request. For example: This pull request adds checkstyle plugin.--> This pull request is to solve issue #8451 In order to try to solve the above problems, I try to use a polling algorithm to allocate files for subtasks (instead of the current random allocation based on file hash), to ensure the load balance of the allocation, so as to improve performance. When using seatunnel to synchronize hdfs files, I set the number of concurrent files to 10, and there are five files in the path. The following screenshots show the file allocation results of using the original random file allocation algorithm in the source code and the improved polling file allocation algorithm:  The original file allocation algorithm based on file hashing, when the degree of parallelism is greater than the number of files, a SubTask needs to process multiple files.  Optimized file allocation algorithm based on polling, when the degree of parallelism is greater than the number of files, a SubTask only needs to process one file. Next, the processing performance of the two allocation algorithm are compared. The following task runtime information shows the processing performance of the origin file allocation algorithm and the polling file allocation algorithm:  As you can see, using the original file allocation algorithm, the task processing performance per second is 4520, and the total task time is 929 seconds  It can be seen that using the polling file allocation algorithm, the task's processing performance per second is 10719, and the total task time is 518 seconds To sum up, it can be seen that the optimized poll-based file allocation algorithm can make the file allocation of subtasks more balanced and effectively improve the task processing performance, which is a direction worthy of consideration for optimization ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released SeaTunnel versions or within the unreleased branches such as dev. If no, write 'No'. If you are adding/modifying connector documents, please follow our new specifications: https://github.com/apache/seatunnel/issues/4544. --> ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If you are adding E2E test cases, maybe refer to https://github.com/apache/seatunnel/blob/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-cdc-mysql-e2e/src/test/resources/mysqlcdc_to_mysql.conf, here is a good example. --> The preceding case is based on the fact that I use seatunnel to synchronize external hdfs files to local hdfs files. In this scenario, I set the task concurrency to 10, source to HdfsFile, sink to HdfsFile, and five files in the upstream Hdfs path. The performance of two different file allocation algorithms is compared by actual synchronization task. My unit test in FileSourceSplitEnumeratorTest class. If you have any questions, please contact me in time. Thanks. ### Check list * [ ] If any new Jar binary package adding in your PR, please add License Notice according [New License Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md) * [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs * [ ] If you are contributing the connector code, please check that the following files are updated: 1. Update [plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties) and add new connector information in it 2. Update the pom file of [seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml) 3. Add ci label in [label-scope-conf](https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml) 4. Add e2e testcase in [seatunnel-e2e](https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/) 5. Update connector [plugin_config](https://github.com/apache/seatunnel/blob/dev/config/plugin_config) * [ ] Update the [`release-note`](https://github.com/apache/seatunnel/blob/dev/release-note.md). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org