Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

Jan Kaul Fri, 22 Nov 2024 09:40:23 -0800

Thanks for your feedback.

About your concerns Fokko:

1. Generally the number of manifest files in the manifests fieldshouldn't get too large. But I think you can already improve the writeamplification and conflict resolution with using up to 10 manifestfiles. The fact that the manifests field only contains paths is notideal and may be a reason to have a separate discussion on a newmetadata field.However, the writer writing the manifest files could keep some kind ofcache of the partition values, statistics so that it doesn't need tofetch the information when writing the manifest-list. This becomes anissue when multiple concurrent writers are at work, because they wouldstill need to fetch the information from the files that they didn't write.As you mentioned, my approach would be to always include the manifestfiles from the manifests field in the query plan and only prune theirmanifest_entries. I would try to keep the number of manifest files inthe manifests field small to reduce this effect, but this coulddefinitely be a drawback.

2. Regarding the sequence-number inheritance, every manifest file inthe manifests field should inherit the sequence-number from the snapshotthat contains it. This means that all manifest files in the manifestsfield have the same sequence-number, which limits the capabilities ofdeletes. One could either limit deletes to only reference data filesthat are are already committed to the manifest-list or one might flushthe manifest files from the manifests field every time a delete file isoccurs. Essentially disabling the proposed behavior. It would stillyield benefits for append only tables.The conflict resolution should be easier for most scenarios as themanifest-list does not need to be rewritten. For appends the newmanifests field is the union of the manifest files of the conflictingmanifests fields.


About your concerns Russel:

My motivation was to have a separation between a consolidated and atemporary list of manifest files. The contents of the temporary listregularly gets moved to the consolidated list. But the fact that thetemporary list is small, reduces the impact of frequent rewrites andmakes it easy to use set operations to resolve conflicts. Thesedifferent lists could be stored as two different manifest files thatcontain other manifests or datafiles. For now I wanted to reuse theexisting manifest-list and manifests fields.


Thanks,

Jan

On 22.11.24 17:02, Russell Spitzer wrote:

I would much rather we switch to the "everything is a manifestapproach. Instead of manifest lists we only ever have manifests. AManifest can then link to data files or additional manifests. In thecase of streaming then you only ever have to read and write a singlemanifest. If we couple this with delete vectors we can greatly reducethe number of writes. I am generally against anything that putsadditional (unbounded) content into the metadata.json. I'm not sure ifanyone has written this up as a full proposal yet but I know it's beendiscussed a bunch.


On Fri, Nov 22, 2024 at 9:31 AM Fokko Driesprong <fo...@apache.org> wrote:

    Hi Jan,

    Thanks for sending out this proposal. While reading through it,
    two questions pop up:

      * You mentioned repurposing the manifestsfield. Currently, this
        field contains a list of paths that point to the manifest
        data. Would this also be your suggestion? This way, when
        committing the accumulated manifests into a manifest list, you
        would need to open up all the manifests to get information
        like partition information, statistics, etc. This way there is
        also no way to distinguish between data and delete manifests
        without having to open the files, effectively always
        including those files in the query plan.
      * It is unclear to me if appending a manifest to the manifests
        will create a new snapshot. I think it should. Either way, I
        think this conflicts with the concept of sequence number
        inheritance
        
<https://github.com/apache/iceberg/blob/main/format/spec.md#sequence-numbers>.
        This is used to avoid having to rewrite a manifest when a
        conflict occurs, you only have to rewrite the manifest list.
        When there is a conflict, the client that sees the conflict,
        will take the latest manifest-list, and inherit in the
        sequence number. When you can append to the manifest list, you
        won't be able to determine which snapshot has added the file.
        If you wouldn't use inheritance, then you would need to
        rewrite the manifest on a conflict (because the sequence ID
        has been used already).

    I have to think a bit more about it but above are my concerns so far.

    Kind regards,
    Fokko

    Op vr 22 nov 2024 om 15:26 schreef Jan Kaul
    <jank...@mailbox.org.invalid>:

        Hi all,

        I'd like to propose an optimization for how we track manifest
        files in Iceberg tables, specifically focusing on reducing
        write amplification and simplifying conflict resolution during
        fast-append operations.


                Background: Replace vs. Change-Based Updates

        To frame this proposal, let's first consider two approaches to
        state management in table systems:

        1. Replace-based updates: The entire state is replaced with
        each update. This is how Iceberg currently handles manifest
        files - when new manifests are added, we create an entirely
        new snapshot.

        2. Change-based updates: Only incremental changes are tracked
        and replayed to derive the current state. This is similar to
        how Delta tables track data files.

        While Iceberg initially used purely replace-based updates,
        we've already successfully adopted change-based updates for
        the top-level table metadata with the REST catalog. Instead of
        uploading entire table metadata, we now only upload new
        snapshots during update-table operations.


                Proposed Enhancement

        I propose extending this change-based approach to manifest
        file tracking, specifically for fast-append operations. Here's
        how:

        1. Repurpose the manifests field as a buffer to track new
        manifest file additions
        2. Define the complete set of manifest files as the union of:
           - Manifest files from the manifest-list
           - Manifest files from the manifests field


                Implementation Details

        - When performing fast-append operations:
          * New manifest files are added to the manifests field
          * Changes are committed via update-table catalog operation
          * The manifest-list remains unchanged, eliminating write
        amplification

        - After a configured number of fast-appends:
          * Manifest files are removed from the manifests field
          * Files are consolidated into a new manifest-list
          * The manifest files are assigned the sequence-number of the
        snapshot when they are written to the manifest-list


                Constraints and Considerations

        For this approach to work effectively, manifest files in the
        manifests field must:
           * Contain only data files that are not referenced by other
        manifests
           * Contain only delete files that reference data files
        already present in the manifest-list

        If any of these assumptions is violated, the manifest files
        from the manifests field are flushed to the manifest-list and
        the standard commit procedure is applied.


                Benefits

        - Significantly reduced write amplification for streaming inserts
        - Simplifies conflict resolution by the catalog. If two
        concurrent writes occur, the entries in the manifests field
        can simply be merged together
        - Leverages existing Iceberg metadata constructs
        - Maintains compatibility with current catalog operations

        Note: While this proposal suggests repurposing the manifests
        field, we could alternatively implement this as a new metadata
        field if preferred.

        I'd appreciate your thoughts on this approach and welcome any
        feedback or concerns.

        Best regards,

        Jan

Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

Reply via email to