I would much rather we switch to the "everything is a manifest approach. Instead of manifest lists we only ever have manifests. A Manifest can then link to data files or additional manifests. In the case of streaming then you only ever have to read and write a single manifest. If we couple this with delete vectors we can greatly reduce the number of writes. I am generally against anything that puts additional (unbounded) content into the metadata.json. I'm not sure if anyone has written this up as a full proposal yet but I know it's been discussed a bunch.
On Fri, Nov 22, 2024 at 9:31 AM Fokko Driesprong <fo...@apache.org> wrote: > Hi Jan, > > Thanks for sending out this proposal. While reading through it, two > questions pop up: > > - You mentioned repurposing the manifests field. Currently, this field > contains a list of paths that point to the manifest data. Would this > also be your suggestion? This way, when committing the accumulated > manifests into a manifest list, you would need to open up all the manifests > to get information like partition information, statistics, etc. This way > there is also no way to distinguish between data and delete manifests > without having to open the files, effectively always including those files > in the query plan. > - It is unclear to me if appending a manifest to the manifests will > create a new snapshot. I think it should. Either way, I think this > conflicts with the concept of sequence number inheritance > > <https://github.com/apache/iceberg/blob/main/format/spec.md#sequence-numbers>. > This is used to avoid having to rewrite a manifest when a conflict occurs, > you only have to rewrite the manifest list. When there is a conflict, the > client that sees the conflict, will take the latest manifest-list, and > inherit in the sequence number. When you can append to the manifest list, > you won't be able to determine which snapshot has added the file. If you > wouldn't use inheritance, then you would need to rewrite the manifest on a > conflict (because the sequence ID has been used already). > > I have to think a bit more about it but above are my concerns so far. > > Kind regards, > Fokko > > Op vr 22 nov 2024 om 15:26 schreef Jan Kaul <jank...@mailbox.org.invalid>: > >> Hi all, >> >> I'd like to propose an optimization for how we track manifest files in >> Iceberg tables, specifically focusing on reducing write amplification and >> simplifying conflict resolution during fast-append operations. >> Background: Replace vs. Change-Based Updates >> >> To frame this proposal, let's first consider two approaches to state >> management in table systems: >> >> 1. Replace-based updates: The entire state is replaced with each update. >> This is how Iceberg currently handles manifest files - when new manifests >> are added, we create an entirely new snapshot. >> >> 2. Change-based updates: Only incremental changes are tracked and >> replayed to derive the current state. This is similar to how Delta tables >> track data files. >> >> While Iceberg initially used purely replace-based updates, we've already >> successfully adopted change-based updates for the top-level table metadata >> with the REST catalog. Instead of uploading entire table metadata, we now >> only upload new snapshots during update-table operations. >> >> Proposed Enhancement >> >> I propose extending this change-based approach to manifest file tracking, >> specifically for fast-append operations. Here's how: >> >> 1. Repurpose the manifests field as a buffer to track new manifest file >> additions >> 2. Define the complete set of manifest files as the union of: >> - Manifest files from the manifest-list >> - Manifest files from the manifests field >> >> Implementation Details >> >> - When performing fast-append operations: >> * New manifest files are added to the manifests field >> * Changes are committed via update-table catalog operation >> * The manifest-list remains unchanged, eliminating write amplification >> >> - After a configured number of fast-appends: >> * Manifest files are removed from the manifests field >> * Files are consolidated into a new manifest-list >> * The manifest files are assigned the sequence-number of the snapshot >> when they are written to the manifest-list >> Constraints and Considerations >> >> For this approach to work effectively, manifest files in the manifests >> field must: >> * Contain only data files that are not referenced by other manifests >> * Contain only delete files that reference data files already present >> in the manifest-list >> >> If any of these assumptions is violated, the manifest files from the >> manifests field are flushed to the manifest-list and the standard commit >> procedure is applied. >> Benefits >> >> - Significantly reduced write amplification for streaming inserts >> - Simplifies conflict resolution by the catalog. If two concurrent writes >> occur, the entries in the manifests field can simply be merged together >> - Leverages existing Iceberg metadata constructs >> - Maintains compatibility with current catalog operations >> >> Note: While this proposal suggests repurposing the manifests field, we >> could alternatively implement this as a new metadata field if preferred. >> >> I'd appreciate your thoughts on this approach and welcome any feedback or >> concerns. >> >> Best regards, >> >> Jan >> >>