ddna1021 opened a new issue, #5026: URL: https://github.com/apache/seatunnel/issues/5026
### Search before asking - [X] I had searched in the [feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description Data blocks in the file system, such as data blocks on HDFS, often have different sizes, and some data blocks are even twice the size of other data blocks. Therefore, when using Apache Seatunnel to import data, there will be serious data skew. Although the degree of parallelism can be specified, its function is equivalent to the coalesce operator of spark, and it cannot really make the data uniform. The main disadvantage of data skew is that it will greatly slow down the execution time and cause a lot of waste of computing resources, and even cause job execution to fail. There are 2 solutions: 1. Run a task before importing data to make the size of the data block uniform and suitable. 2. In Seatunnel, repartiton the imported dataset first, modify the SinkExecuteProcessor class, and repartiton according to the configured parallelism. After testing, both methods are equally effective. The first method will cause an unnecessary IO overhead, and the second method is appropriate in terms of effect and cost. ### Usage Scenario _No response_ ### Related issues _No response_ ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org