DanielCarter-stack commented on issue #10623:
URL: https://github.com/apache/seatunnel/issues/10623#issuecomment-4089050022
<!-- code-pr-reviewer -->
This is a valid memory management issue when writing large datasets to
Iceberg with upsert mode enabled.
**Root cause**: `IcebergSinkWriter.results` accumulates `WriteResult`
objects between checkpoints. These hold `DataFile[]` and `DeleteFile[]`
references that are only cleared during `snapshotState()`, not in
`prepareCommit()`.
**Risk factors**:
- **BATCH mode without `checkpoint.interval`** - `snapshotState()` may only
be called once at job end, causing unbounded accumulation
- **Large `write.target-file-size-bytes`** - buffers more data per file
before rolling
- **High-cardinality partitions** - maintains a separate writer per partition
- **Upsert mode** - generates both data and delete files, doubling memory
overhead
**Immediate mitigations**:
1. Add `checkpoint.interval` (critical for BATCH mode): `env {
checkpoint.interval = 300000 }`
2. Reduce target file size: `write.target-file-size-bytes = 134217728`
(128MB)
3. Disable upsert if not required: `iceberg.table.upsert-mode-enabled =
false`
**Questions to help design a proper fix**:
1. Are you running in BATCH or STREAMING mode?
2. What is your `checkpoint.interval` setting?
3. What is the partition key and approximate number of distinct partitions?
4. Can you share a heap dump or thread dump from the OOM?
Potential improvements: memory-based flush thresholds and documenting
checkpoint requirements for BATCH mode.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]