hailin0 opened a new issue, #8837:
URL: https://github.com/apache/seatunnel/issues/8837

   ### Search before asking
   
   - [x] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   Currently we have supported the file connector and scan files in the 
directory, with each file as a split.
   
   However, for very large files, it will cause slow reading and require 
further sharding. We can consider sharding a single file again.
   
   example:
   ```
   file_1.csv     100gb
   file_2.csv     100mb
   file_3.csv     100kb
   ```
   
   splits result:
   ```
   split_1<file_1.csv, startPos=0, endPos=104857600>
   split_2<file_1.csv, startPos=104857600, endPos=209715200>
   ....
   split_x<file_2.csv, startPos=0, endPos=104857600>
   split_y<file_3.csv, startPos=0, endPos=102400>
   ```
   
   You need to consider that the data rows read by each split are complete. The 
above is only for reference and does not have to be followed completely.
   
   
   Connectors list:
   
https://github.com/apache/seatunnel/tree/dev/seatunnel-connectors-v2/connector-file
   
   
   Updates:
   - update file connectors
   - update docs
   - add testcase
   
   
   ### Usage Scenario
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to