I'd think chunking the work as much as possible, and disabling metrics for columns where they're not helpful probably goes far but perhaps may be insufficient for extreme cases. I've also been thinking about if there are better space-efficient data structures for maintaining file paths which exploit the fact that there's a common location prefix for the files. Specifically, I was thinking of radix trees (compressed tries) https://en.wikipedia.org/wiki/Radix_tree.
For example if the file paths are all "s3://<some-warehouse-uuid>/<some-database-uuid>/<some-table-uuid>/data/<partition>/file.parquet"; a normal hashset<datafile> when the set is very large is just going to have many files, many of which repeat the same prefix. With a radix tree, file paths should (in theory) consume significantly less memory, because there'd be a single representation of "s3://<some-warehouse-uuid>/<some-database-uuid>/<some-table-uuid>/data/" instead of a million times in the case of a million data files. In Iceberg, we probably wouldn't really be leveraging the efficient prefix lookups that this data structure provides since we don't really need that operation, but it's lookups on keys should be as good as a hashset with the additional benefit of the reduced memory consumption due to exploiting the nature of these file paths. I've played around with this idea in the remove orphan files procedure https://github.com/apache/iceberg/pull/10229, but still need to collect data points on the benefits. I also plan on writing a benchmark which will generate a bunch of these files and use instrumentation to see the memory consumption. Thanks, Amogh Jahagirdar On Wed, May 22, 2024 at 1:15 AM Naveen Kumar <nk1...@gmail.com> wrote: > Hi Szehon, > > Thanks for your email. > > I agree configuring metadata metrics per column will create a smaller > manifest file with lower and upper bounds per content entry. Assuming your > patch > <https://github.com/apache/iceberg/pull/2608>is merged, it will works as > following: > > 1. A user should identify all the columns on which pruning is not > needed. > 2. Updated the table properties and disabled metrics on those columns. > 3. Run repair manifests <https://github.com/apache/iceberg/pull/2608>. > 4. Run compaction now. > > However this will still not solve the original problem where *Set<DataFile>, > Set<DeleteFile> *has grown to a significantly big number(say 1M). This > might be very rare but I have seen examples where a user never ran > compaction and after adding a new partition column they are trying to > compact the entire table. > > Is this a valid use case? WDYT? > > Regards, > Naveen Kumar > > > > On Tue, May 21, 2024 at 10:47 PM Szehon Ho <szehon.apa...@gmail.com> > wrote: > >> Hi Naveen >> >> Yes it sounds like it will help to disable metrics for those columns? >> Iirc, by default it manifest entries have metrics at 'truncate(16)' level >> for 100 columns, which as you see can be quite memory intensive. A >> potential improvement later also is to have the ability to remove counts by >> config, though need to confirm if that is feasible. >> >> Unfortunately today the new metrics config will only apply to new data >> files (you have to rewrite them all, or otherwise phase old data files >> out). I had a patch awhile back to add support for rewriting just manifest >> with new metric config but was not merged yet, if any reviewer has time to >> review, I can work on it again. >> https://github.com/apache/iceberg/pull/2608 >> <https://github.com/apache/iceberg/pull/2608> >> >> Thanks >> Szehon >> >> On Tue, May 21, 2024 at 1:43 AM Naveen Kumar <nk1...@gmail.com> wrote: >> >>> Hi Everyone, >>> >>> I am looking into RewriteFiles >>> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/RewriteFiles.java> >>> APIs and its implementation BaseRewriteFiles >>> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java>. >>> Currently this works as following: >>> >>> 1. It accumulates all the files for addition and deletions. >>> 2. At time of commit, it creates a new snapshot after adding all the >>> entries to corresponding manifest files. >>> >>> It has been observed that if the accumulated file objects are of huge >>> size it takes a lot of memory. >>> *eg*: Each dataFile object is of size *1KB*. Total >>> accumulated(additions or deletions) size is *1 million. * >>> Total memory consumed by *RewriteFiles* will be around *1GB*. >>> >>> Such dataset can happen with following reasons: >>> >>> 1. Table is very wide with say 1000 columns. >>> 2. Most of the columns are of String data type, which can take more >>> space to store lower bound and upper bound. >>> 3. Table has billions of records with millions of data files. >>> 4. It is running data compaction procedures/jobs for the first time. >>> 5. Or, Table was UN-partitioned and later evolved by new partition >>> columns. >>> 6. Now it is trying to compact the table >>> >>> Attaching heap dump from one of the dataset while using API >>> >>>> RewriteFiles rewriteFiles( >>>> Set<DataFile> removedDataFiles, >>>> Set<DeleteFile> removedDeleteFiles, >>>> Set<DataFile> addedDataFiles, >>>> Set<DeleteFile> addedDeleteFiles) >>>> >>>> >>> [image: Screenshot 2024-01-11 at 10.01.54 PM.png] >>> We do have properties like PARTIAL_PROGRESS_ENABLED_DEFAULT >>> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L45C11-L45C43>, >>> which helps create smaller groups and multiple commits with configuration >>> PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT >>> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>. >>> Currently engines like SPARK can follow this strategy. Since SPARK is >>> running all the compaction jobs concurrently, there are chances many jobs >>> can land on the same machines and accumulate with high memory usage. >>> >>> My question is, can we make these implementations >>> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>better >>> to avoid any heap pressure? Also, has someone encountered similar issues >>> and if so how did they fix it? >>> >>> Regards, >>> Naveen Kumar >>> >>>