Hi Jan,

Thanks for sending out this proposal. While reading through it, two
questions pop up:

   - You mentioned repurposing the manifests field. Currently, this field
   contains a list of paths that point to the manifest data. Would this
   also be your suggestion? This way, when committing the accumulated
   manifests into a manifest list, you would need to open up all the manifests
   to get information like partition information, statistics, etc. This way
   there is also no way to distinguish between data and delete manifests
   without having to open the files, effectively always including those files
   in the query plan.
   - It is unclear to me if appending a manifest to the manifests will
   create a new snapshot. I think it should. Either way, I think this
   conflicts with the concept of sequence number inheritance
   
<https://github.com/apache/iceberg/blob/main/format/spec.md#sequence-numbers>.
   This is used to avoid having to rewrite a manifest when a conflict occurs,
   you only have to rewrite the manifest list. When there is a conflict, the
   client that sees the conflict, will take the latest manifest-list, and
   inherit in the sequence number. When you can append to the manifest list,
   you won't be able to determine which snapshot has added the file. If you
   wouldn't use inheritance, then you would need to rewrite the manifest on a
   conflict (because the sequence ID has been used already).

I have to think a bit more about it but above are my concerns so far.

Kind regards,
Fokko

Op vr 22 nov 2024 om 15:26 schreef Jan Kaul <jank...@mailbox.org.invalid>:

> Hi all,
>
> I'd like to propose an optimization for how we track manifest files in
> Iceberg tables, specifically focusing on reducing write amplification and
> simplifying conflict resolution during fast-append operations.
> Background: Replace vs. Change-Based Updates
>
> To frame this proposal, let's first consider two approaches to state
> management in table systems:
>
> 1. Replace-based updates: The entire state is replaced with each update.
> This is how Iceberg currently handles manifest files - when new manifests
> are added, we create an entirely new snapshot.
>
> 2. Change-based updates: Only incremental changes are tracked and replayed
> to derive the current state. This is similar to how Delta tables track data
> files.
>
> While Iceberg initially used purely replace-based updates, we've already
> successfully adopted change-based updates for the top-level table metadata
> with the REST catalog. Instead of uploading entire table metadata, we now
> only upload new snapshots during update-table operations.
>
> Proposed Enhancement
>
> I propose extending this change-based approach to manifest file tracking,
> specifically for fast-append operations. Here's how:
>
> 1. Repurpose the manifests field as a buffer to track new manifest file
> additions
> 2. Define the complete set of manifest files as the union of:
>    - Manifest files from the manifest-list
>    - Manifest files from the manifests field
>
> Implementation Details
>
> - When performing fast-append operations:
>   * New manifest files are added to the manifests field
>   * Changes are committed via update-table catalog operation
>   * The manifest-list remains unchanged, eliminating write amplification
>
> - After a configured number of fast-appends:
>   * Manifest files are removed from the manifests field
>   * Files are consolidated into a new manifest-list
>   * The manifest files are assigned the sequence-number of the snapshot
> when they are written to the manifest-list
> Constraints and Considerations
>
> For this approach to work effectively, manifest files in the manifests
> field must:
>    * Contain only data files that are not referenced by other manifests
>    * Contain only delete files that reference data files already present
> in the manifest-list
>
> If any of these assumptions is violated, the manifest files from the
> manifests field are flushed to the manifest-list and the standard commit
> procedure is applied.
> Benefits
>
> - Significantly reduced write amplification for streaming inserts
> - Simplifies conflict resolution by the catalog. If two concurrent writes
> occur, the entries in the manifests field can simply be merged together
> - Leverages existing Iceberg metadata constructs
> - Maintains compatibility with current catalog operations
>
> Note: While this proposal suggests repurposing the manifests field, we
> could alternatively implement this as a new metadata field if preferred.
>
> I'd appreciate your thoughts on this approach and welcome any feedback or
> concerns.
>
> Best regards,
>
> Jan
>
>

Reply via email to