zhenyue-xu opened a new pull request, #9668:
URL: https://github.com/apache/seatunnel/pull/9668

   ### Purpose of this pull request
   
   This pull request enhances the Excel file sink connector by adding support 
for automatic sheet splitting when data exceeds a configurable maximum number 
of rows per sheet. The main improvements include:
   
   1. Added a new configuration option `sheet_max_rows` to control the maximum 
number of rows per Excel sheet (default: 1,048,576)
   2. Implemented automatic sheet creation when the row limit is reached
   3. Fixed the sheet naming strategy to use a more predictable pattern 
(Sheet0, Sheet1, etc.)
   4. Improved the row tracking mechanism to properly handle multiple sheets
   
   This change fixes a critical issue where large datasets would cause runtime 
exceptions when exceeding Excel's row limits.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this PR introduces a new user-facing configuration option:
   
   - **New Configuration**: `sheet_max_rows` - An integer parameter that allows 
users to specify the maximum number of data rows per Excel sheet (excluding 
header row). The default value is 1,048,576 which matches Excel's maximum row 
limit.
   
   **Previous behavior**: Excel files would put all data into a single sheet 
regardless of size, which would cause a runtime exception when exceeding 
Excel's row limitations:
   ```
   java.lang.IllegalArgumentException: Invalid row number (1048576) outside 
allowable range (0..1048575)
        at 
org.apache.poi.xssf.streaming.SXSSFSheet.createRow(SXSSFSheet.java:123)
        at 
org.apache.poi.xssf.streaming.SXSSFSheet.createRow(SXSSFSheet.java:65)
   ```
   
   **New behavior**: When the number of data rows exceeds the configured 
`sheet_max_rows` value, the connector automatically creates new sheets (Sheet1, 
Sheet2, etc.) and continues writing data, with each sheet having its own header 
row. This prevents the runtime exception and allows processing of datasets of 
any size.
   
   This is a backward-compatible change as the default value maintains the 
existing behavior for datasets within Excel's limits, while fixing the error 
for larger datasets.
   
   ### How was this patch tested?
   
   This patch was tested with comprehensive unit tests added in 
`ExcelGeneratorTest.java`:
   
   1. **Basic functionality test** (`testGenerateBasicExcelFile`):
      - Tests writing a small dataset (5 rows) to verify basic Excel generation 
functionality
      - Validates file creation, content structure, and header row generation
      - Tests handling of null values in data
   
   2. **Large dataset test** (`testGenerateLargeDataFile`):
      - Tests writing 1.2 million rows to verify automatic sheet splitting 
functionality  
      - Validates that data exceeding the sheet limit creates multiple sheets 
properly
      - Tests performance with large datasets
      - Ensures the runtime exception is prevented
   
   3. **Validation helper methods**:
      - `validateGeneratedFile()` method that opens the generated Excel file 
and verifies:
        - Correct number of sheets created
        - Proper row counts per sheet
        - Header row structure and content
        - Data integrity
   
   The tests use Apache POI to read and validate the generated Excel files, 
ensuring the output is correctly formatted and accessible by standard Excel 
readers.
   
   ### Check list
   
   * [ ] If any new Jar binary package adding in your PR, please add License 
Notice according
     [New License 
Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md)
   * [ ] If necessary, please update the documentation to describe the new 
feature. https://github.com/apache/seatunnel/tree/dev/docs
   * [ ] If you are contributing the connector code, please check that the 
following files are updated:
     1. Update 
[plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties)
 and add new connector information in it
     2. Update the pom file of 
[seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml)
     3. Add ci label in 
[label-scope-conf](https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml)
     4. Add e2e testcase in 
[seatunnel-e2e](https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/)
     5. Update connector 
[plugin_config](https://github.com/apache/seatunnel/blob/dev/config/plugin_config)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to