Alberne commented on code in PR #9526:
URL: https://github.com/apache/seatunnel/pull/9526#discussion_r2214736981
##########
docs/en/connector-v2/source/OssFile.md:
##########
@@ -194,19 +194,22 @@ If you assign file type to `parquet` `orc`, schema option
not required, connecto
| time_format | string | no | HH:mm:ss | Time
type format, used to tell connector how to convert string to time, supported as
the following formats:`HH:mm:ss` `HH:mm:ss.SSS`
|
| filename_extension | string | no | - |
Filter filename extension, which used for filtering files with specific
extension. Example: `csv` `.txt` `json` `.xml`.
|
| skip_header_row_number | long | no | 0 | Skip
the first few lines, but only for the txt and csv. For example, set like
following:`skip_header_row_number = 2`. Then SeaTunnel will skip the first 2
lines from source files
|
-| csv_use_header_line | boolean | no | false |
Whether to use the header line to parse the file, only used when the
file_format is `csv` and the file contains the header line that match RFC 4180
|
+| csv_use_header_line | boolean | no | false |
Whether to use the header line to parse the file, only used when the
file_format is `csv` and the file contains the header line that match RFC 4180
|
| schema | config | no | - | The
schema of upstream data.
|
| sheet_name | string | no | - |
Reader the sheet of the workbook,Only used when file_format is excel.
|
| xml_row_tag | string | no | - |
Specifies the tag name of the data rows within the XML file, only used when
file_format is xml.
|
| xml_use_attr_format | boolean | no | - |
Specifies whether to process data using the tag attribute format, only used
when file_format is xml.
|
-| csv_use_header_line | boolean | no | false |
Whether to use the header line to parse the file, only used when the
file_format is `csv` and the file contains the header line that match RFC 4180
|
+| csv_use_header_line | boolean | no | false |
Whether to use the header line to parse the file, only used when the
file_format is `csv` and the file contains the header line that match RFC 4180
|
| compress_codec | string | no | none | Which
compress codec the files used.
|
| encoding | string | no | UTF-8 |
| null_format | string | no | - | Only
used when file_format_type is text. null_format to define which strings can be
represented as null. e.g: `\N`
|
-| binary_chunk_size | int | no | 1024 | Only
used when file_format_type is binary. The chunk size (in bytes) for reading
binary files. Default is 1024 bytes. Larger values may improve performance for
large files but use more memory.
|
+| binary_chunk_size | int | no | 1024 | Only
used when file_format_type is binary. The chunk size (in bytes) for reading
binary files. Default is 1024 bytes. Larger values may improve performance for
large files but use more memory.
|
| binary_complete_file_mode | boolean | no | false | Only
used when file_format_type is binary. Whether to read the complete file as a
single chunk instead of splitting into chunks. When enabled, the entire file
content will be read into memory at once. Default is false.
|
| file_filter_pattern | string | no | |
Filter pattern, which used for filtering files.
|
| common-options | config | no | - |
Source plugin common parameters, please refer to [Source Common
Options](../source-common-options.md) for details.
|
+| file_filter_modified_start | string | no | - |
File modification time filter. The connector will filter some files base on the
last modification start time (include start time). the default data format is
yyyy-mm-dd, if you not set `file_filter_modified_date_format`.
|
+| file_filter_modified_end | string | no | - |
File modification time filter. The connector will filter some files base on the
last modification end time (not include end time). the default data format is
yyyy-mm-dd, if you not set `file_filter_modified_date_format`.
|
+| file_filter_modified_date_format | string | no | -
| File modification time format. This parameter specifies the file's last
modification time for filtering, using which time format. If not set, it
defaults to the yyyy-MM-dd format, with the time zone defaulting to GMT+8.
|
Review Comment:
@Hisoka-X
You are correct, however, consider a scenario: for incremental
synchronization of files by hour, the yyyy-MM-dd HH format should be used.
Additionally, in our case, the third-party Appsflyer system we are
integrating with generates incremental attribution data for the past X days
every 2 hours. The naming format of these files cannot be easily
identified, so we can only rely on the file modification time to
determine them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]