Re: [PR] [Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards [seatunnel]

via GitHub Thu, 13 Feb 2025 00:36:51 -0800


Hisoka-X commented on code in PR #8507:
URL: https://github.com/apache/seatunnel/pull/8507#discussion_r1954047394



##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -102,6 +107,22 @@ The compress codec of files and the details that supported 
as the following show
 - orc/parquet:  
   automatically recognizes the compression type, no additional settings 
required.
 
+### split_single_file_to_multiple_splits
+
+whether to split a file into many splits. true will split.
+
+### file_size_per_split
+
+split a file into many splits according to file size, if row_count_per_split 
not config. use row_count_per_split prefer. only valid for orc/parquet now.
+
+### row_count_per_split
+
+split a file into many splits according to row count. only valid for 
orc/parquet now.

Review Comment:
   ```suggestion
   Split a file into many splits according to row count. Only valid for 
orc/parquet now.
   ```



##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -102,6 +107,22 @@ The compress codec of files and the details that supported 
as the following show
 - orc/parquet:  
   automatically recognizes the compression type, no additional settings 
required.
 
+### split_single_file_to_multiple_splits
+
+whether to split a file into many splits. true will split.
+
+### file_size_per_split
+
+split a file into many splits according to file size, if row_count_per_split 
not config. use row_count_per_split prefer. only valid for orc/parquet now.
+
+### row_count_per_split
+
+split a file into many splits according to row count. only valid for 
orc/parquet now.
+
+### batch_read_rows
+
+max size in a batch. now only useful for orc file. default is 1024, if memory 
is enough, you can increase it to speed up reading.

Review Comment:
   ```suggestion
   The max size in a batch, now only useful for orc file. The default  value is 
1024, if memory is enough, you can increase it to speed up reading. Only worked 
when enable split_single_file_to_multiple_splits.
   ```



##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -33,21 +33,26 @@ Read all the data in a split in a pollNext call. What 
splits are read will be sa
 
 ## Options
 
-|         name          |  type  | required | default value  |
-|-----------------------|--------|----------|----------------|
-| table_name            | string | yes      | -              |
-| metastore_uri         | string | yes      | -              |
-| krb5_path             | string | no       | /etc/krb5.conf |
-| kerberos_principal    | string | no       | -              |
-| kerberos_keytab_path  | string | no       | -              |
-| hdfs_site_path        | string | no       | -              |
-| hive_site_path        | string | no       | -              |
-| hive.hadoop.conf      | Map    | no       | -              |
-| hive.hadoop.conf-path | string | no       | -              |
-| read_partitions       | list   | no       | -              |
-| read_columns          | list   | no       | -              |
-| compress_codec        | string | no       | none           |
-| common-options        |        | no       | -              |
+|         name                         |  type   | required | default value  |
+|--------------------------------------|---------|----------|----------------|
+| table_name                           | string  | yes      | -              |
+| metastore_uri                        | string  | yes      | -              |
+| krb5_path                            | string  | no       | /etc/krb5.conf |
+| kerberos_principal                   | string  | no       | -              |
+| kerberos_keytab_path                 | string  | no       | -              |
+| hdfs_site_path                       | string  | no       | -              |
+| hive_site_path                       | string  | no       | -              |
+| hive.hadoop.conf                     | Map     | no       | -              |
+| hive.hadoop.conf-path                | string  | no       | -              |
+| read_partitions                      | list    | no       | -              |
+| read_columns                         | list    | no       | -              |
+| compress_codec                       | string  | no       | -              |
+| compress_codec                       | string  | no       | -              |
+| split_single_file_to_multiple_splits | long    | no       | false          | 

Review Comment:
   ```suggestion
   | split_single_file_to_multiple_splits | boolean    | no       | false       
   | 
   ```



##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -102,6 +107,22 @@ The compress codec of files and the details that supported 
as the following show
 - orc/parquet:  
   automatically recognizes the compression type, no additional settings 
required.
 
+### split_single_file_to_multiple_splits
+
+whether to split a file into many splits. true will split.

Review Comment:
   ```suggestion
   Whether to split a file into many splits. If true will split. Only valid for 
orc/parquet now.
   ```



##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -102,6 +107,22 @@ The compress codec of files and the details that supported 
as the following show
 - orc/parquet:  
   automatically recognizes the compression type, no additional settings 
required.
 
+### split_single_file_to_multiple_splits
+
+whether to split a file into many splits. true will split.
+
+### file_size_per_split
+
+split a file into many splits according to file size, if row_count_per_split 
not config. use row_count_per_split prefer. only valid for orc/parquet now.

Review Comment:
   ```suggestion
   Split a file into many splits according to file size, if row_count_per_split 
not config, use row_count_per_split prefer. Only valid for orc/parquet now.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards [seatunnel]

Reply via email to