zhangqs0205 opened a new issue, #9469: URL: https://github.com/apache/seatunnel/issues/9469
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22) and found no similar issues. ### What happened ### Description: I encountered a memory leak while using the HdfsFile Sink to write data in Parquet format. ### Dataset Details: Total records: 100 million Fields per row: 10 ### Analysis: After reviewing the code, I identified a potential memory leak in AbstractWriteStrategy: The default batch_size (1,000,000 records) controls maximum records per file. ParquetWriter's default rowGroupSize (128MB) limits in-memory caching before disk flushing. ### Root Cause: With my low-column-count data: - 1 million records ≈ 80MB (< 128MB threshold) - Thus, no single ParquetWriter triggers disk flush - 100 million records require 100 ParquetWriter objects - Total memory consumption ≈ 8GB → OutOfMemoryError ### Code Evidence: org.apache.seatunnel.connectors.seatunnel.file.sink.writer.AbstractWriteStrategy ``` @Override public void write(SeaTunnelRow seaTunnelRow) throws FileConnectorException { if (currentBatchSize >= batchSize) { newFilePath(); // Only regenerates file path currentBatchSize = 0; // Resets counter } currentBatchSize++; } ``` Problem: When reaching `batchSize`, the code resets the file path but doesn't call `finishAndCloseFile()` to flush cached data. Buffered data remains in memory until forced flush at `prepareCommit()` : ``` @Override public Optional<FileCommitInfo> prepareCommit() { this.finishAndCloseFile(); // Delayed flush occurs HERE ... } ``` ### Consequence: All 100 ParquetWriter objects retain ~80MB each in memory until the final `prepareCommit()` call, causing sustained 8GB heap pressure and eventual OOM. ### SeaTunnel Version 2.3.9 ### SeaTunnel Config ```conf { "transform": { "Sql": { "query": "select field1 as distinct_id, `field2` as `user_tag_drbq2`,`field3` as `user_tag_drbq3`,`field4` as `user_tag_drbq4`,`field5` as `user_tag_drbq5`,`field6` as `user_tag_drbq6` from source_table_285", "plugin_input": "source_table_285", "plugin_output": "transform_table_285" } }, "sink": { "HdfsFile": { "fs.defaultFS": "hdfs://nameservice01", "hdfs_site_path": "/sensorsdata/main/platform_guidance/instances/hdfs/connection_info/hdfs-site.xml", "path": "/sa/integrator/tmp/p1/285/tag/sink/20250617", "file_format_type": "parquet", "row_delimiter": null, "plugin_input": "transform_table_285", "tmp_path": "/sa/tmp/seatunnel" } }, "source": { "Jdbc": { "url": "jdbc:hive2://10.1.136.181:8416", "driver": "org.apache.hive.jdbc.HiveDriver", "user": "horizon_sys", "password": "1234", "query": "SELECT `field4`,`field1`,`field5`,`field2`,`field6`,`field3` \n FROM \n`horizon_workspace_default_1`.`big_data1`", "plugin_output": "source_table_285", "use_ssl": null, "ssl_ca_cert_path": null, "ssl_client_cert_path": null, "ssl_client_key_path": null } }, "env": { "job.flink.customized.conf": "taskmanager.memory.process.size=5120m", "job.mode": "BATCH", "parallelism": "1", "job.name": "Seatunnel_V2_JobId_91_InstanceId_285" } } ``` ### Running Command ```shell sh bin/seatunnel.sh --config ../write.conf ``` ### Error Exception ```log OutOfMemoryException ``` ### Zeta or Flink or Spark Version _No response_ ### Java or Scala Version _No response_ ### Screenshots _No response_ ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
