[GitHub] [seatunnel] ddna1021 opened a new issue, #5026: [Feature][Spark Starter] make data block balance before importing data

via GitHub Tue, 04 Jul 2023 23:12:13 -0700


ddna1021 opened a new issue, #5026:
URL: https://github.com/apache/seatunnel/issues/5026


   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   Data blocks in the file system, such as data blocks on HDFS, often have 
different sizes, and some data blocks are even twice the size of other data 
blocks. Therefore, when using Apache Seatunnel to import data, there will be 
serious data skew. Although the degree of parallelism can be specified, its 
function is equivalent to the coalesce operator of spark, and it cannot really 
make the data uniform. The main disadvantage of data skew is that it will 
greatly slow down the execution time and cause a lot of waste of computing 
resources, and even cause job execution to fail.
   
   There are 2 solutions:
   1. Run a task before importing data to make the size of the data block 
uniform and suitable.
   2. In Seatunnel, repartiton the imported dataset first, modify the 
SinkExecuteProcessor class, and repartiton according to the configured 
parallelism.
   
   After testing, both methods are equally effective. The first method will 
cause an unnecessary IO overhead, and the second method is appropriate in terms 
of effect and cost.
   
   ### Usage Scenario
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [seatunnel] ddna1021 opened a new issue, #5026: [Feature][Spark Starter] make data block balance before importing data

Reply via email to