Re: [PR] [Improve][Connector-V2][OssFile] OSSFile Source Support filtering files by last modified time. [seatunnel]

via GitHub Thu, 17 Jul 2025 06:17:09 -0700


Hisoka-X commented on code in PR #9526:
URL: https://github.com/apache/seatunnel/pull/9526#discussion_r2213324925



##########
docs/en/connector-v2/source/OssFile.md:
##########
@@ -194,19 +194,22 @@ If you assign file type to `parquet` `orc`, schema option 
not required, connecto
 | time_format               | string  | no       | HH:mm:ss            | Time 
type format, used to tell connector how to convert string to time, supported as 
the following formats:`HH:mm:ss` `HH:mm:ss.SSS`                                 
                                                                                
                                                                               |
 | filename_extension        | string  | no       | -                   | 
Filter filename extension, which used for filtering files with specific 
extension. Example: `csv` `.txt` `json` `.xml`.                                 
                                                                                
                                                                                
            |
 | skip_header_row_number    | long    | no       | 0                   | Skip 
the first few lines, but only for the txt and csv. For example, set like 
following:`skip_header_row_number = 2`. Then SeaTunnel will skip the first 2 
lines from source files                                                         
                                                                                
         |
-| csv_use_header_line       | boolean | no       | false               | 
Whether to use the header line to parse the file, only used when the 
file_format is `csv` and the file contains the header line that match RFC 4180  
                                                                                
                                                                                
                       |
+| csv_use_header_line       | boolean | no       | false               | 
Whether to use the header line to parse the file, only used when the 
file_format is `csv` and the file contains the header line that match RFC 4180  
                                                                                
                                                                                
               |
 | schema                    | config  | no       | -                   | The 
schema of upstream data.                                                        
                                                                                
                                                                                
                                                                                
|
 | sheet_name                | string  | no       | -                   | 
Reader the sheet of the workbook,Only used when file_format is excel.           
                                                                                
                                                                                
                                                                                
    |
 | xml_row_tag               | string  | no       | -                   | 
Specifies the tag name of the data rows within the XML file, only used when 
file_format is xml.                                                             
                                                                                
                                                                                
        |
 | xml_use_attr_format       | boolean | no       | -                   | 
Specifies whether to process data using the tag attribute format, only used 
when file_format is xml.                                                        
                                                                                
                                                                                
        |
-| csv_use_header_line       | boolean | no       | false               | 
Whether to use the header line to parse the file, only used when the 
file_format is `csv` and the file contains the header line that match RFC 4180  
                                                                                
                                                                                
                       |
+| csv_use_header_line       | boolean | no       | false               | 
Whether to use the header line to parse the file, only used when the 
file_format is `csv` and the file contains the header line that match RFC 4180  
                                                                                
                                                                                
               |
 | compress_codec            | string  | no       | none                | Which 
compress codec the files used.                                                  
                                                                                
                                                                                
                                                                              |
 | encoding                  | string  | no       | UTF-8               |
 | null_format               | string  | no       | -                   | Only 
used when file_format_type is text. null_format to define which strings can be 
represented as null. e.g: `\N`                                                  
                                                                                
                                                                                
|
-| binary_chunk_size         | int     | no       | 1024                | Only 
used when file_format_type is binary. The chunk size (in bytes) for reading 
binary files. Default is 1024 bytes. Larger values may improve performance for 
large files but use more memory.                                                
                                                                                
   |
+| binary_chunk_size         | int     | no       | 1024                | Only 
used when file_format_type is binary. The chunk size (in bytes) for reading 
binary files. Default is 1024 bytes. Larger values may improve performance for 
large files but use more memory.                                                
                                                                                
    |
 | binary_complete_file_mode | boolean | no       | false               | Only 
used when file_format_type is binary. Whether to read the complete file as a 
single chunk instead of splitting into chunks. When enabled, the entire file 
content will be read into memory at once. Default is false.                     
                                                                                
     |
 | file_filter_pattern       | string  | no       |                     | 
Filter pattern, which used for filtering files.                                 
                                                                                
                                                                                
                                                                                
    |
 | common-options            | config  | no       | -                   | 
Source plugin common parameters, please refer to [Source Common 
Options](../source-common-options.md) for details.                              
                                                                                
                                                                                
                    |
+| file_filter_modified_start  | string  | no       | -                   | 
File modification time filter. The connector will filter some files base on the 
last modification start time (include start time). the default data format is 
yyyy-mm-dd, if you not set `file_filter_modified_date_format`.                  
                                                                                
      |
+| file_filter_modified_end    | string  | no       | -                   | 
File modification time filter. The connector will filter some files base on the 
last modification end time (not include end time). the default data format is 
yyyy-mm-dd, if you not set `file_filter_modified_date_format`.                  
                                                                                
      |
+| file_filter_modified_date_format  | string  | no       | -                   
| File modification time format. This parameter specifies the file's last 
modification time for filtering, using which time format. If not set, it 
defaults to the yyyy-MM-dd format, with the time zone defaulting to GMT+8.      
                                                                                
                   |

Review Comment:
   Filtering by day actually contains implicit logic. For example, files after 
2025-06-07 are actually files after 2025-06-07 00:00:00. The effects of the two 
are the same. So I think additional configuration is unnecessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Improve][Connector-V2][OssFile] OSSFile Source Support filtering files by last modified time. [seatunnel]

Reply via email to