I would much rather we switch to the "everything is a manifest approach.
Instead of manifest lists we only ever have manifests. A Manifest can then
link to data files or additional manifests. In the case of streaming then
you only ever have to read and write a single manifest. If we couple this
with delete vectors we can greatly reduce the number of writes. I am
generally against anything that puts additional (unbounded) content into
the metadata.json. I'm not sure if anyone has written this up as a full
proposal yet but I know it's been discussed a bunch.

On Fri, Nov 22, 2024 at 9:31 AM Fokko Driesprong <fo...@apache.org> wrote:

> Hi Jan,
>
> Thanks for sending out this proposal. While reading through it, two
> questions pop up:
>
>    - You mentioned repurposing the manifests field. Currently, this field
>    contains a list of paths that point to the manifest data. Would this
>    also be your suggestion? This way, when committing the accumulated
>    manifests into a manifest list, you would need to open up all the manifests
>    to get information like partition information, statistics, etc. This way
>    there is also no way to distinguish between data and delete manifests
>    without having to open the files, effectively always including those files
>    in the query plan.
>    - It is unclear to me if appending a manifest to the manifests will
>    create a new snapshot. I think it should. Either way, I think this
>    conflicts with the concept of sequence number inheritance
>    
> <https://github.com/apache/iceberg/blob/main/format/spec.md#sequence-numbers>.
>    This is used to avoid having to rewrite a manifest when a conflict occurs,
>    you only have to rewrite the manifest list. When there is a conflict, the
>    client that sees the conflict, will take the latest manifest-list, and
>    inherit in the sequence number. When you can append to the manifest list,
>    you won't be able to determine which snapshot has added the file. If you
>    wouldn't use inheritance, then you would need to rewrite the manifest on a
>    conflict (because the sequence ID has been used already).
>
> I have to think a bit more about it but above are my concerns so far.
>
> Kind regards,
> Fokko
>
> Op vr 22 nov 2024 om 15:26 schreef Jan Kaul <jank...@mailbox.org.invalid>:
>
>> Hi all,
>>
>> I'd like to propose an optimization for how we track manifest files in
>> Iceberg tables, specifically focusing on reducing write amplification and
>> simplifying conflict resolution during fast-append operations.
>> Background: Replace vs. Change-Based Updates
>>
>> To frame this proposal, let's first consider two approaches to state
>> management in table systems:
>>
>> 1. Replace-based updates: The entire state is replaced with each update.
>> This is how Iceberg currently handles manifest files - when new manifests
>> are added, we create an entirely new snapshot.
>>
>> 2. Change-based updates: Only incremental changes are tracked and
>> replayed to derive the current state. This is similar to how Delta tables
>> track data files.
>>
>> While Iceberg initially used purely replace-based updates, we've already
>> successfully adopted change-based updates for the top-level table metadata
>> with the REST catalog. Instead of uploading entire table metadata, we now
>> only upload new snapshots during update-table operations.
>>
>> Proposed Enhancement
>>
>> I propose extending this change-based approach to manifest file tracking,
>> specifically for fast-append operations. Here's how:
>>
>> 1. Repurpose the manifests field as a buffer to track new manifest file
>> additions
>> 2. Define the complete set of manifest files as the union of:
>>    - Manifest files from the manifest-list
>>    - Manifest files from the manifests field
>>
>> Implementation Details
>>
>> - When performing fast-append operations:
>>   * New manifest files are added to the manifests field
>>   * Changes are committed via update-table catalog operation
>>   * The manifest-list remains unchanged, eliminating write amplification
>>
>> - After a configured number of fast-appends:
>>   * Manifest files are removed from the manifests field
>>   * Files are consolidated into a new manifest-list
>>   * The manifest files are assigned the sequence-number of the snapshot
>> when they are written to the manifest-list
>> Constraints and Considerations
>>
>> For this approach to work effectively, manifest files in the manifests
>> field must:
>>    * Contain only data files that are not referenced by other manifests
>>    * Contain only delete files that reference data files already present
>> in the manifest-list
>>
>> If any of these assumptions is violated, the manifest files from the
>> manifests field are flushed to the manifest-list and the standard commit
>> procedure is applied.
>> Benefits
>>
>> - Significantly reduced write amplification for streaming inserts
>> - Simplifies conflict resolution by the catalog. If two concurrent writes
>> occur, the entries in the manifests field can simply be merged together
>> - Leverages existing Iceberg metadata constructs
>> - Maintains compatibility with current catalog operations
>>
>> Note: While this proposal suggests repurposing the manifests field, we
>> could alternatively implement this as a new metadata field if preferred.
>>
>> I'd appreciate your thoughts on this approach and welcome any feedback or
>> concerns.
>>
>> Best regards,
>>
>> Jan
>>
>>

Reply via email to