Re: [PR] [Improve][Connector-V2][OssFile] OSSFile Source Support filtering files by last modified time. [seatunnel]

via GitHub Thu, 17 Jul 2025 19:27:55 -0700


Alberne commented on code in PR #9526:
URL: https://github.com/apache/seatunnel/pull/9526#discussion_r2214736981



##########
docs/en/connector-v2/source/OssFile.md:
##########
@@ -194,19 +194,22 @@ If you assign file type to `parquet` `orc`, schema option 
not required, connecto
 | time_format               | string  | no       | HH:mm:ss            | Time 
type format, used to tell connector how to convert string to time, supported as 
the following formats:`HH:mm:ss` `HH:mm:ss.SSS`                                 
                                                                                
                                                                               |
 | filename_extension        | string  | no       | -                   | 
Filter filename extension, which used for filtering files with specific 
extension. Example: `csv` `.txt` `json` `.xml`.                                 
                                                                                
                                                                                
            |
 | skip_header_row_number    | long    | no       | 0                   | Skip 
the first few lines, but only for the txt and csv. For example, set like 
following:`skip_header_row_number = 2`. Then SeaTunnel will skip the first 2 
lines from source files                                                         
                                                                                
         |
-| csv_use_header_line       | boolean | no       | false               | 
Whether to use the header line to parse the file, only used when the 
file_format is `csv` and the file contains the header line that match RFC 4180  
                                                                                
                                                                                
                       |
+| csv_use_header_line       | boolean | no       | false               | 
Whether to use the header line to parse the file, only used when the 
file_format is `csv` and the file contains the header line that match RFC 4180  
                                                                                
                                                                                
               |
 | schema                    | config  | no       | -                   | The 
schema of upstream data.                                                        
                                                                                
                                                                                
                                                                                
|
 | sheet_name                | string  | no       | -                   | 
Reader the sheet of the workbook,Only used when file_format is excel.           
                                                                                
                                                                                
                                                                                
    |
 | xml_row_tag               | string  | no       | -                   | 
Specifies the tag name of the data rows within the XML file, only used when 
file_format is xml.                                                             
                                                                                
                                                                                
        |
 | xml_use_attr_format       | boolean | no       | -                   | 
Specifies whether to process data using the tag attribute format, only used 
when file_format is xml.                                                        
                                                                                
                                                                                
        |
-| csv_use_header_line       | boolean | no       | false               | 
Whether to use the header line to parse the file, only used when the 
file_format is `csv` and the file contains the header line that match RFC 4180  
                                                                                
                                                                                
                       |
+| csv_use_header_line       | boolean | no       | false               | 
Whether to use the header line to parse the file, only used when the 
file_format is `csv` and the file contains the header line that match RFC 4180  
                                                                                
                                                                                
               |
 | compress_codec            | string  | no       | none                | Which 
compress codec the files used.                                                  
                                                                                
                                                                                
                                                                              |
 | encoding                  | string  | no       | UTF-8               |
 | null_format               | string  | no       | -                   | Only 
used when file_format_type is text. null_format to define which strings can be 
represented as null. e.g: `\N`                                                  
                                                                                
                                                                                
|
-| binary_chunk_size         | int     | no       | 1024                | Only 
used when file_format_type is binary. The chunk size (in bytes) for reading 
binary files. Default is 1024 bytes. Larger values may improve performance for 
large files but use more memory.                                                
                                                                                
   |
+| binary_chunk_size         | int     | no       | 1024                | Only 
used when file_format_type is binary. The chunk size (in bytes) for reading 
binary files. Default is 1024 bytes. Larger values may improve performance for 
large files but use more memory.                                                
                                                                                
    |
 | binary_complete_file_mode | boolean | no       | false               | Only 
used when file_format_type is binary. Whether to read the complete file as a 
single chunk instead of splitting into chunks. When enabled, the entire file 
content will be read into memory at once. Default is false.                     
                                                                                
     |
 | file_filter_pattern       | string  | no       |                     | 
Filter pattern, which used for filtering files.                                 
                                                                                
                                                                                
                                                                                
    |
 | common-options            | config  | no       | -                   | 
Source plugin common parameters, please refer to [Source Common 
Options](../source-common-options.md) for details.                              
                                                                                
                                                                                
                    |
+| file_filter_modified_start  | string  | no       | -                   | 
File modification time filter. The connector will filter some files base on the 
last modification start time (include start time). the default data format is 
yyyy-mm-dd, if you not set `file_filter_modified_date_format`.                  
                                                                                
      |
+| file_filter_modified_end    | string  | no       | -                   | 
File modification time filter. The connector will filter some files base on the 
last modification end time (not include end time). the default data format is 
yyyy-mm-dd, if you not set `file_filter_modified_date_format`.                  
                                                                                
      |
+| file_filter_modified_date_format  | string  | no       | -                   
| File modification time format. This parameter specifies the file's last 
modification time for filtering, using which time format. If not set, it 
defaults to the yyyy-MM-dd format, with the time zone defaulting to GMT+8.      
                                                                                
                   |

Review Comment:
   @Hisoka-X
    You are correct, however, consider a scenario: for incremental 
synchronization of files by hour, the yyyy-MM-dd HH format should be used. 
   
   Additionally, in our case, the third-party Appsflyer system we are 
integrating with generates incremental attribution data for the past X days 
every 2 hours. The naming format of these files cannot be easily 
identified, so we can only rely on the file modification time to 
determine them.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Improve][Connector-V2][OssFile] OSSFile Source Support filtering files by last modified time. [seatunnel]

Reply via email to