Hi Szehon,

Thanks for your email.

I agree configuring metadata metrics per column will create a smaller
manifest file with lower and upper bounds per content entry. Assuming
your patch
<https://github.com/apache/iceberg/pull/2608>is merged, it will works as
following:

   1. A user should identify all the columns on which pruning is not needed.
   2. Updated the table properties and disabled metrics on those columns.
   3. Run repair manifests <https://github.com/apache/iceberg/pull/2608>.
   4. Run compaction now.

However this will still not solve the original problem where *Set<DataFile>,
Set<DeleteFile>  *has grown to a significantly big number(say 1M). This
might be very rare but I have seen examples where a user never ran
compaction and after adding a new partition column they are trying to
compact the entire table.

Is this a valid use case? WDYT?

Regards,
Naveen Kumar



On Tue, May 21, 2024 at 10:47 PM Szehon Ho <[email protected]> wrote:

> Hi Naveen
>
> Yes it sounds like it will help to disable metrics for those columns?
> Iirc, by default it manifest entries have metrics at 'truncate(16)' level
> for 100 columns, which as you see can be quite memory intensive.  A
> potential improvement later also is to have the ability to remove counts by
> config, though need to confirm if that is feasible.
>
> Unfortunately today the new metrics config will only apply to new data
> files  (you have to rewrite them all, or otherwise phase old data files
> out).  I had a patch awhile back to add support for rewriting just manifest
> with new metric config but was not merged yet, if any reviewer has time to
> review, I can work on it again.
> https://github.com/apache/iceberg/pull/2608
> <https://github.com/apache/iceberg/pull/2608>
>
> Thanks
> Szehon
>
> On Tue, May 21, 2024 at 1:43 AM Naveen Kumar <[email protected]> wrote:
>
>> Hi Everyone,
>>
>> I am looking into RewriteFiles
>> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/RewriteFiles.java>
>> APIs and its implementation BaseRewriteFiles
>> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java>.
>> Currently this works as following:
>>
>>    1. It accumulates all the files for addition and deletions.
>>    2. At time of commit, it creates a new snapshot after adding all the
>>    entries to corresponding manifest files.
>>
>> It has been observed that if the accumulated file objects are of huge
>> size it takes a lot of memory.
>> *eg*: Each dataFile object is of size *1KB*. Total accumulated(additions
>> or deletions) size is *1 million. *
>> Total memory consumed by *RewriteFiles* will be around *1GB*.
>>
>> Such dataset can happen with following reasons:
>>
>>    1. Table is very wide with say 1000 columns.
>>    2. Most of the columns are of String data type, which can take more
>>    space to store lower bound and upper bound.
>>    3. Table has billions of records with millions of data files.
>>    4. It is running data compaction procedures/jobs for the first time.
>>    5. Or, Table was UN-partitioned and later evolved by new partition
>>    columns.
>>    6. Now it is trying to compact the table
>>
>> Attaching heap dump from one of the dataset while using API
>>
>>> RewriteFiles rewriteFiles(
>>>     Set<DataFile> removedDataFiles,
>>>     Set<DeleteFile> removedDeleteFiles,
>>>     Set<DataFile> addedDataFiles,
>>>     Set<DeleteFile> addedDeleteFiles)
>>>
>>>
>> [image: Screenshot 2024-01-11 at 10.01.54 PM.png]
>> We do have properties like PARTIAL_PROGRESS_ENABLED_DEFAULT
>> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L45C11-L45C43>,
>> which helps create smaller groups and multiple commits with configuration
>> PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT
>> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>.
>> Currently engines like SPARK can follow this strategy. Since SPARK is
>> running all the compaction jobs concurrently, there are chances many jobs
>> can land on the same machines and accumulate with high memory usage.
>>
>> My question is, can we make these implementations
>> <https://github.com/apache/iceberg/blob/8d6bee736884575da7368e0963268d1cbe362d90/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L53C7-L53C43>better
>> to avoid any heap pressure? Also, has someone encountered similar issues
>> and if so how did they fix it?
>>
>> Regards,
>> Naveen Kumar
>>
>>

Reply via email to