[Discuss] Heap pressure with RewriteFiles APIs

Naveen Kumar Tue, 21 May 2024 01:43:34 -0700

Hi Everyone,

I am looking into RewriteFiles
<https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/RewriteFiles.java>
APIs and its implementation BaseRewriteFiles
<https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java>.
Currently this works as following:


   1. It accumulates all the files for addition and deletions.
   2. At time of commit, it creates a new snapshot after adding all the
   entries to corresponding manifest files.

It has been observed that if the accumulated file objects are of huge size
it takes a lot of memory.
*eg*: Each dataFile object is of size *1KB*. Total accumulated(additions or
deletions) size is *1 million. *
Total memory consumed by *RewriteFiles* will be around *1GB*.

Such dataset can happen with following reasons:

   1. Table is very wide with say 1000 columns.
   2. Most of the columns are of String data type, which can take more
   space to store lower bound and upper bound.
   3. Table has billions of records with millions of data files.
   4. It is running data compaction procedures/jobs for the first time.
   5. Or, Table was UN-partitioned and later evolved by new partition
   columns.
   6. Now it is trying to compact the table

Attaching heap dump from one of the dataset while using API

> RewriteFiles rewriteFiles(
>     Set<DataFile> removedDataFiles,
>     Set<DeleteFile> removedDeleteFiles,
>     Set<DataFile> addedDataFiles,
>     Set<DeleteFile> addedDeleteFiles)
>
>
[image: Screenshot 2024-01-11 at 10.01.54 PM.png]
We do have properties like PARTIAL_PROGRESS_ENABLED_DEFAULT
<https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L45C11-L45C43>,
which helps create smaller groups and multiple commits with configuration
PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT
<https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>.
Currently engines like SPARK can follow this strategy. Since SPARK is
running all the compaction jobs concurrently, there are chances many jobs
can land on the same machines and accumulate with high memory usage.

My question is, can we make these implementations
<https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>better
to avoid any heap pressure? Also, has someone encountered similar issues
and if so how did they fix it?

Regards,
Naveen Kumar

[Discuss] Heap pressure with RewriteFiles APIs

Reply via email to