Hi Everyone, I am looking into RewriteFiles <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/RewriteFiles.java> APIs and its implementation BaseRewriteFiles <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java>. Currently this works as following:
1. It accumulates all the files for addition and deletions. 2. At time of commit, it creates a new snapshot after adding all the entries to corresponding manifest files. It has been observed that if the accumulated file objects are of huge size it takes a lot of memory. *eg*: Each dataFile object is of size *1KB*. Total accumulated(additions or deletions) size is *1 million. * Total memory consumed by *RewriteFiles* will be around *1GB*. Such dataset can happen with following reasons: 1. Table is very wide with say 1000 columns. 2. Most of the columns are of String data type, which can take more space to store lower bound and upper bound. 3. Table has billions of records with millions of data files. 4. It is running data compaction procedures/jobs for the first time. 5. Or, Table was UN-partitioned and later evolved by new partition columns. 6. Now it is trying to compact the table Attaching heap dump from one of the dataset while using API > RewriteFiles rewriteFiles( > Set<DataFile> removedDataFiles, > Set<DeleteFile> removedDeleteFiles, > Set<DataFile> addedDataFiles, > Set<DeleteFile> addedDeleteFiles) > > [image: Screenshot 2024-01-11 at 10.01.54 PM.png] We do have properties like PARTIAL_PROGRESS_ENABLED_DEFAULT <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L45C11-L45C43>, which helps create smaller groups and multiple commits with configuration PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>. Currently engines like SPARK can follow this strategy. Since SPARK is running all the compaction jobs concurrently, there are chances many jobs can land on the same machines and accumulate with high memory usage. My question is, can we make these implementations <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>better to avoid any heap pressure? Also, has someone encountered similar issues and if so how did they fix it? Regards, Naveen Kumar