Hi Szehon, Thanks for your email.
I agree configuring metadata metrics per column will create a smaller manifest file with lower and upper bounds per content entry. Assuming your patch <https://github.com/apache/iceberg/pull/2608>is merged, it will works as following: 1. A user should identify all the columns on which pruning is not needed. 2. Updated the table properties and disabled metrics on those columns. 3. Run repair manifests <https://github.com/apache/iceberg/pull/2608>. 4. Run compaction now. However this will still not solve the original problem where *Set<DataFile>, Set<DeleteFile> *has grown to a significantly big number(say 1M). This might be very rare but I have seen examples where a user never ran compaction and after adding a new partition column they are trying to compact the entire table. Is this a valid use case? WDYT? Regards, Naveen Kumar On Tue, May 21, 2024 at 10:47 PM Szehon Ho <[email protected]> wrote: > Hi Naveen > > Yes it sounds like it will help to disable metrics for those columns? > Iirc, by default it manifest entries have metrics at 'truncate(16)' level > for 100 columns, which as you see can be quite memory intensive. A > potential improvement later also is to have the ability to remove counts by > config, though need to confirm if that is feasible. > > Unfortunately today the new metrics config will only apply to new data > files (you have to rewrite them all, or otherwise phase old data files > out). I had a patch awhile back to add support for rewriting just manifest > with new metric config but was not merged yet, if any reviewer has time to > review, I can work on it again. > https://github.com/apache/iceberg/pull/2608 > <https://github.com/apache/iceberg/pull/2608> > > Thanks > Szehon > > On Tue, May 21, 2024 at 1:43 AM Naveen Kumar <[email protected]> wrote: > >> Hi Everyone, >> >> I am looking into RewriteFiles >> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/RewriteFiles.java> >> APIs and its implementation BaseRewriteFiles >> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java>. >> Currently this works as following: >> >> 1. It accumulates all the files for addition and deletions. >> 2. At time of commit, it creates a new snapshot after adding all the >> entries to corresponding manifest files. >> >> It has been observed that if the accumulated file objects are of huge >> size it takes a lot of memory. >> *eg*: Each dataFile object is of size *1KB*. Total accumulated(additions >> or deletions) size is *1 million. * >> Total memory consumed by *RewriteFiles* will be around *1GB*. >> >> Such dataset can happen with following reasons: >> >> 1. Table is very wide with say 1000 columns. >> 2. Most of the columns are of String data type, which can take more >> space to store lower bound and upper bound. >> 3. Table has billions of records with millions of data files. >> 4. It is running data compaction procedures/jobs for the first time. >> 5. Or, Table was UN-partitioned and later evolved by new partition >> columns. >> 6. Now it is trying to compact the table >> >> Attaching heap dump from one of the dataset while using API >> >>> RewriteFiles rewriteFiles( >>> Set<DataFile> removedDataFiles, >>> Set<DeleteFile> removedDeleteFiles, >>> Set<DataFile> addedDataFiles, >>> Set<DeleteFile> addedDeleteFiles) >>> >>> >> [image: Screenshot 2024-01-11 at 10.01.54 PM.png] >> We do have properties like PARTIAL_PROGRESS_ENABLED_DEFAULT >> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L45C11-L45C43>, >> which helps create smaller groups and multiple commits with configuration >> PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT >> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>. >> Currently engines like SPARK can follow this strategy. Since SPARK is >> running all the compaction jobs concurrently, there are chances many jobs >> can land on the same machines and accumulate with high memory usage. >> >> My question is, can we make these implementations >> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>better >> to avoid any heap pressure? Also, has someone encountered similar issues >> and if so how did they fix it? >> >> Regards, >> Naveen Kumar >> >>
