Re: [PR] [Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards [seatunnel]

via GitHub Mon, 13 Jan 2025 17:15:42 -0800


liunaijie commented on code in PR #8507:
URL: https://github.com/apache/seatunnel/pull/8507#discussion_r1913994777



##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java:
##########
@@ -80,6 +86,212 @@
 public class OrcReadStrategy extends AbstractReadStrategy {
     private static final long MIN_SIZE = 16 * 1024;
 
+    private int batchReadRows = 1024;
+
+    /** user can specified row count per split */
+    private long rowCountPerSplitByUser = 0;
+
+    private final long DEFAULT_FILE_SIZE_PER_SPLIT = 1024 * 1024 * 30;
+    private final long DEFAULT_ROW_COUNT = 100000;
+    private long fileSizePerSplitByUser = DEFAULT_FILE_SIZE_PER_SPLIT;
+
+    @Override
+    public void setPluginConfig(Config pluginConfig) {
+        super.setPluginConfig(pluginConfig);
+        if 
(pluginConfig.hasPath(BaseSourceConfigOptions.ROW_COUNT_PER_SPLIT.key())) {
+            rowCountPerSplitByUser =
+                    
pluginConfig.getLong(BaseSourceConfigOptions.ROW_COUNT_PER_SPLIT.key());
+        }
+        if 
(pluginConfig.hasPath(BaseSourceConfigOptions.FILE_SIZE_PER_SPLIT.key())) {
+            fileSizePerSplitByUser =
+                    
pluginConfig.getLong(BaseSourceConfigOptions.FILE_SIZE_PER_SPLIT.key());
+        }
+    }
+
+    /**
+     * split a file into many splits: good: 1. lower memory occupy. split read 
end, the memory can
+     * recycle. 2. lower checkpoint ack delay 3. Support fine-grained 
concurrency bad: 1. cannot
+     * guarantee the order of the data.
+     *
+     * @param path 文件路径
+     * @return FileSourceSplit set
+     */
+    @Override
+    public Set<FileSourceSplit> getFileSourceSplits(String path) {

Review Comment:
   Hi, thanks for this great work, as you write when enable this it can mark 
**data out of order**, so can we enable this feature when user config 
file_size_per_split/row_count_per_split parameters? if not set, still use the 
original method. we can describe this feature in document.
   
   And when use this feaute we need the file format can be quick seek to the 
specified offset, like `rows.seekToRow` method. 
   Now we has other file format support, like `parquet`, `avro` etc, can you 
help update other file format too, thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards [seatunnel]

Reply via email to