JeremyXin opened a new pull request, #8453:
URL: https://github.com/apache/seatunnel/pull/8453

   <!--
   
   Thank you for contributing to SeaTunnel! Please make sure that your code 
changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   ## Contribution Checklist
     - Make sure that the pull request corresponds to a [GITHUB 
issue](https://github.com/apache/seatunnel/issues).
     - Name the pull request in the form "[Feature] [component] Title of the 
pull request", where *Feature* can be replaced by `Hotfix`, `Bug`, etc.
     - Minor fixes should be named following this pattern: `[hotfix] [docs] Fix 
typo in README.md doc`.
   -->
   
   ### Purpose of this pull request
   
   <!-- Describe the purpose of this pull request. For example: This pull 
request adds checkstyle plugin.-->
   This pull request is to solve issue #8451
   
   In order to try to solve the above problems, I try to use a polling 
algorithm to allocate files for subtasks  (instead of the current random 
allocation based on file hash), to ensure the load balance of the allocation, 
so as to improve performance. When using seatunnel to synchronize hdfs files, I 
set the number of concurrent files to 10, and there are five files in the path. 
The following screenshots show the file allocation results of using the 
original random file allocation algorithm in the source code and the improved 
polling file allocation algorithm:
   
   
![基于文件哈希的文件分配](https://github.com/user-attachments/assets/93d52b77-d1bc-4c5b-82c3-79ec2d1be2b0)
   The original file allocation algorithm based on file hashing, when the 
degree of parallelism is greater than the number of files, a SubTask needs to 
process multiple files.
   
   
![基于轮询算法的文件分配](https://github.com/user-attachments/assets/b1e22635-4243-4828-9baa-e29e4e5180e8)
   Optimized file allocation algorithm based on polling, when the degree of 
parallelism is greater than the number of files, a SubTask only needs to 
process one file.
   
   Next, the processing performance of the two allocation algorithm are 
compared. The following task runtime information shows the processing 
performance of the origin file allocation algorithm and the polling file 
allocation algorithm:
   
   
![基于文件哈希的处理拼接](https://github.com/user-attachments/assets/c45db393-2992-42c8-9234-210c5e93cfdd)
   As you can see, using the original file allocation algorithm, the task 
processing performance per second is 4520, and the total task time is 929 
seconds
   
   
![基于轮询的处理拼接](https://github.com/user-attachments/assets/19ca8bd7-844a-4454-a56d-825ee9b267b2)
   It can be seen that using the polling file allocation algorithm, the task's 
processing performance per second is 10719, and the total task time is 518 
seconds
   
   To sum up, it can be seen that the optimized poll-based file allocation 
algorithm can make the file allocation of subtasks more balanced and 
effectively improve the task processing performance, which is a direction 
worthy of consideration for optimization
   
   ### Does this PR introduce _any_ user-facing change?
   
   <!--
   Note that it means *any* user-facing change including all aspects such as 
the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes 
- provide the console output, description and/or an example to show the 
behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to 
the released SeaTunnel versions or within the unreleased branches such as dev.
   If no, write 'No'.
   If you are adding/modifying connector documents, please follow our new 
specifications: https://github.com/apache/seatunnel/issues/4544.
   -->
   
   
   ### How was this patch tested?
   
   <!--
   If tests were added, say they were added here. Please make sure to add some 
test cases that check the changes thoroughly including negative and positive 
cases if possible.
   If it was tested in a way different from regular unit tests, please clarify 
how you tested step by step, ideally copy and paste-able, so that other 
reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why 
it was difficult to add.
   If you are adding E2E test cases, maybe refer to 
https://github.com/apache/seatunnel/blob/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-cdc-mysql-e2e/src/test/resources/mysqlcdc_to_mysql.conf,
 here is a good example.
   -->
   The preceding case is based on the fact that I use seatunnel to synchronize 
external hdfs files to local hdfs files. In this scenario, I set the task 
concurrency to 10, source to HdfsFile, sink to HdfsFile, and five files in the 
upstream Hdfs path. The performance of two different file allocation algorithms 
is compared by actual synchronization task.
   
   My unit test in FileSourceSplitEnumeratorTest class.
   
   If you have any questions, please contact me in time. Thanks.
   
   ### Check list
   
   * [ ] If any new Jar binary package adding in your PR, please add License 
Notice according
     [New License 
Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md)
   * [ ] If necessary, please update the documentation to describe the new 
feature. https://github.com/apache/seatunnel/tree/dev/docs
   * [ ] If you are contributing the connector code, please check that the 
following files are updated:
     1. Update 
[plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties)
 and add new connector information in it
     2. Update the pom file of 
[seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml)
     3. Add ci label in 
[label-scope-conf](https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml)
     4. Add e2e testcase in 
[seatunnel-e2e](https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/)
     5. Update connector 
[plugin_config](https://github.com/apache/seatunnel/blob/dev/config/plugin_config)
   * [ ] Update the 
[`release-note`](https://github.com/apache/seatunnel/blob/dev/release-note.md).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to