Thanks for your feedback.

About your concerns Fokko:

1. Generally the number of manifest files in the manifests field shouldn't get too large. But I think you can already improve the write amplification and conflict resolution with using up to 10 manifest files. The fact that the manifests field only contains paths is not ideal and may be a reason to have a separate discussion on a new metadata field. However, the writer writing the manifest files could keep some kind of cache of the partition values, statistics so that it doesn't need to fetch the information when writing the manifest-list. This becomes an issue when multiple concurrent writers are at work, because they would still need to fetch the information from the files that they didn't write. As you mentioned, my approach would be to always include the manifest files from the manifests field in the query plan and only prune their manifest_entries. I would try to keep the number of manifest files in the manifests field small to reduce this effect, but this could definitely be a drawback.

2.  Regarding the sequence-number inheritance, every manifest file in the manifests field should inherit the sequence-number from the snapshot that contains it. This means that all manifest files in the manifests field have the same sequence-number, which limits the capabilities of deletes. One could either limit deletes to only reference data files that are are already committed to the manifest-list or one might flush the manifest files from the manifests field every time a delete file is occurs. Essentially disabling the proposed behavior. It would still yield benefits for append only tables. The conflict resolution should be easier for most scenarios as the manifest-list does not need to be rewritten. For appends the new manifests field is the union of the manifest files of the conflicting manifests fields.

About your concerns Russel:

My motivation was to have a separation between a consolidated and a temporary list of manifest files. The contents of the temporary list regularly gets moved to the consolidated list. But the fact that the temporary list is small, reduces the impact of frequent rewrites and makes it easy to use set operations to resolve conflicts. These different lists could be stored as two different manifest files that contain other manifests or datafiles. For now I wanted to reuse the existing manifest-list and manifests fields.

Thanks,

Jan

On 22.11.24 17:02, Russell Spitzer wrote:
I would much rather we switch to the "everything is a manifest approach. Instead of manifest lists we only ever have manifests. A Manifest can then link to data files or additional manifests. In the case of streaming then you only ever have to read and write a single manifest. If we couple this with delete vectors we can greatly reduce the number of writes. I am generally against anything that puts additional (unbounded) content into the metadata.json. I'm not sure if anyone has written this up as a full proposal yet but I know it's been discussed a bunch.

On Fri, Nov 22, 2024 at 9:31 AM Fokko Driesprong <fo...@apache.org> wrote:

    Hi Jan,

    Thanks for sending out this proposal. While reading through it,
    two questions pop up:

      * You mentioned repurposing the manifestsfield. Currently, this
        field contains a list of paths that point to the manifest
        data. Would this also be your suggestion? This way, when
        committing the accumulated manifests into a manifest list, you
        would need to open up all the manifests to get information
        like partition information, statistics, etc. This way there is
        also no way to distinguish between data and delete manifests
        without having to open the files, effectively always
        including those files in the query plan.
      * It is unclear to me if appending a manifest to the manifests
        will create a new snapshot. I think it should. Either way, I
        think this conflicts with the concept of sequence number
        inheritance
        
<https://github.com/apache/iceberg/blob/main/format/spec.md#sequence-numbers>.
        This is used to avoid having to rewrite a manifest when a
        conflict occurs, you only have to rewrite the manifest list.
        When there is a conflict, the client that sees the conflict,
        will take the latest manifest-list, and inherit in the
        sequence number. When you can append to the manifest list, you
        won't be able to determine which snapshot has added the file.
        If you wouldn't use inheritance, then you would need to
        rewrite the manifest on a conflict (because the sequence ID
        has been used already).

    I have to think a bit more about it but above are my concerns so far.

    Kind regards,
    Fokko

    Op vr 22 nov 2024 om 15:26 schreef Jan Kaul
    <jank...@mailbox.org.invalid>:

        Hi all,

        I'd like to propose an optimization for how we track manifest
        files in Iceberg tables, specifically focusing on reducing
        write amplification and simplifying conflict resolution during
        fast-append operations.


                Background: Replace vs. Change-Based Updates

        To frame this proposal, let's first consider two approaches to
        state management in table systems:

        1. Replace-based updates: The entire state is replaced with
        each update. This is how Iceberg currently handles manifest
        files - when new manifests are added, we create an entirely
        new snapshot.

        2. Change-based updates: Only incremental changes are tracked
        and replayed to derive the current state. This is similar to
        how Delta tables track data files.

        While Iceberg initially used purely replace-based updates,
        we've already successfully adopted change-based updates for
        the top-level table metadata with the REST catalog. Instead of
        uploading entire table metadata, we now only upload new
        snapshots during update-table operations.


                Proposed Enhancement

        I propose extending this change-based approach to manifest
        file tracking, specifically for fast-append operations. Here's
        how:

        1. Repurpose the manifests field as a buffer to track new
        manifest file additions
        2. Define the complete set of manifest files as the union of:
           - Manifest files from the manifest-list
           - Manifest files from the manifests field


                Implementation Details

        - When performing fast-append operations:
          * New manifest files are added to the manifests field
          * Changes are committed via update-table catalog operation
          * The manifest-list remains unchanged, eliminating write
        amplification

        - After a configured number of fast-appends:
          * Manifest files are removed from the manifests field
          * Files are consolidated into a new manifest-list
          * The manifest files are assigned the sequence-number of the
        snapshot when they are written to the manifest-list


                Constraints and Considerations

        For this approach to work effectively, manifest files in the
        manifests field must:
           * Contain only data files that are not referenced by other
        manifests
           * Contain only delete files that reference data files
        already present in the manifest-list

        If any of these assumptions is violated, the manifest files
        from the manifests field are flushed to the manifest-list and
        the standard commit procedure is applied.


                Benefits

        - Significantly reduced write amplification for streaming inserts
        - Simplifies conflict resolution by the catalog. If two
        concurrent writes occur, the entries in the manifests field
        can simply be merged together
        - Leverages existing Iceberg metadata constructs
        - Maintains compatibility with current catalog operations

        Note: While this proposal suggests repurposing the manifests
        field, we could alternatively implement this as a new metadata
        field if preferred.

        I'd appreciate your thoughts on this approach and welcome any
        feedback or concerns.

        Best regards,

        Jan

Reply via email to