Would cadding the ability to have a list of manifest lists solve this problem? This might be an incremental step to getting to "everything" is a manifest?
For now I wanted to reuse the existing manifest-list and manifests fields. Regardless of the outcome, please let's not re-use a field in a way that will change the semantics of the field this goes against good practices on forward compatibility. Cheers, Micah On Fri, Nov 22, 2024 at 9:31 AM Jan Kaul <jank...@mailbox.org.invalid> wrote: > Thanks for your feedback. > > About your concerns Fokko: > > 1. Generally the number of manifest files in the manifests field > shouldn't get too large. But I think you can already improve the write > amplification and conflict resolution with using up to 10 manifest files. > The fact that the manifests field only contains paths is not ideal and > may be a reason to have a separate discussion on a new metadata field. > However, the writer writing the manifest files could keep some kind of > cache of the partition values, statistics so that it doesn't need to fetch > the information when writing the manifest-list. This becomes an issue when > multiple concurrent writers are at work, because they would still need to > fetch the information from the files that they didn't write. > As you mentioned, my approach would be to always include the manifest > files from the manifests field in the query plan and only prune their > manifest_entries. I would try to keep the number of manifest files in the > manifests field small to reduce this effect, but this could definitely be > a drawback. > > 2. Regarding the sequence-number inheritance, every manifest file in the > manifests field should inherit the sequence-number from the snapshot that > contains it. This means that all manifest files in the manifests field > have the same sequence-number, which limits the capabilities of deletes. > One could either limit deletes to only reference data files that are are > already committed to the manifest-list or one might flush the manifest > files from the manifests field every time a delete file is occurs. > Essentially disabling the proposed behavior. It would still yield benefits > for append only tables. > The conflict resolution should be easier for most scenarios as the > manifest-list does not need to be rewritten. For appends the new manifests > field is the union of the manifest files of the conflicting manifests > fields. > > About your concerns Russel: > > My motivation was to have a separation between a consolidated and a > temporary list of manifest files. The contents of the temporary list > regularly gets moved to the consolidated list. But the fact that the > temporary list is small, reduces the impact of frequent rewrites and makes > it easy to use set operations to resolve conflicts. These different lists > could be stored as two different manifest files that contain other > manifests or datafiles. For now I wanted to reuse the existing > manifest-list and manifests fields. > > Thanks, > > Jan > On 22.11.24 17:02, Russell Spitzer wrote: > > I would much rather we switch to the "everything is a manifest approach. > Instead of manifest lists we only ever have manifests. A Manifest can then > link to data files or additional manifests. In the case of streaming then > you only ever have to read and write a single manifest. If we couple this > with delete vectors we can greatly reduce the number of writes. I am > generally against anything that puts additional (unbounded) content into > the metadata.json. I'm not sure if anyone has written this up as a full > proposal yet but I know it's been discussed a bunch. > > On Fri, Nov 22, 2024 at 9:31 AM Fokko Driesprong <fo...@apache.org> wrote: > >> Hi Jan, >> >> Thanks for sending out this proposal. While reading through it, two >> questions pop up: >> >> - You mentioned repurposing the manifests field. Currently, this >> field contains a list of paths that point to the manifest data. Would >> this also be your suggestion? This way, when committing the accumulated >> manifests into a manifest list, you would need to open up all the >> manifests >> to get information like partition information, statistics, etc. This way >> there is also no way to distinguish between data and delete manifests >> without having to open the files, effectively always including those files >> in the query plan. >> - It is unclear to me if appending a manifest to the manifests will >> create a new snapshot. I think it should. Either way, I think this >> conflicts with the concept of sequence number inheritance >> >> <https://github.com/apache/iceberg/blob/main/format/spec.md#sequence-numbers>. >> This is used to avoid having to rewrite a manifest when a conflict occurs, >> you only have to rewrite the manifest list. When there is a conflict, the >> client that sees the conflict, will take the latest manifest-list, and >> inherit in the sequence number. When you can append to the manifest list, >> you won't be able to determine which snapshot has added the file. If you >> wouldn't use inheritance, then you would need to rewrite the manifest on a >> conflict (because the sequence ID has been used already). >> >> I have to think a bit more about it but above are my concerns so far. >> >> Kind regards, >> Fokko >> >> Op vr 22 nov 2024 om 15:26 schreef Jan Kaul <jank...@mailbox.org.invalid> >> <jank...@mailbox.org.invalid>: >> >>> Hi all, >>> >>> I'd like to propose an optimization for how we track manifest files in >>> Iceberg tables, specifically focusing on reducing write amplification and >>> simplifying conflict resolution during fast-append operations. >>> Background: Replace vs. Change-Based Updates >>> >>> To frame this proposal, let's first consider two approaches to state >>> management in table systems: >>> >>> 1. Replace-based updates: The entire state is replaced with each update. >>> This is how Iceberg currently handles manifest files - when new manifests >>> are added, we create an entirely new snapshot. >>> >>> 2. Change-based updates: Only incremental changes are tracked and >>> replayed to derive the current state. This is similar to how Delta tables >>> track data files. >>> >>> While Iceberg initially used purely replace-based updates, we've already >>> successfully adopted change-based updates for the top-level table metadata >>> with the REST catalog. Instead of uploading entire table metadata, we now >>> only upload new snapshots during update-table operations. >>> >>> Proposed Enhancement >>> >>> I propose extending this change-based approach to manifest file >>> tracking, specifically for fast-append operations. Here's how: >>> >>> 1. Repurpose the manifests field as a buffer to track new manifest file >>> additions >>> 2. Define the complete set of manifest files as the union of: >>> - Manifest files from the manifest-list >>> - Manifest files from the manifests field >>> >>> Implementation Details >>> >>> - When performing fast-append operations: >>> * New manifest files are added to the manifests field >>> * Changes are committed via update-table catalog operation >>> * The manifest-list remains unchanged, eliminating write amplification >>> >>> - After a configured number of fast-appends: >>> * Manifest files are removed from the manifests field >>> * Files are consolidated into a new manifest-list >>> * The manifest files are assigned the sequence-number of the snapshot >>> when they are written to the manifest-list >>> Constraints and Considerations >>> >>> For this approach to work effectively, manifest files in the manifests >>> field must: >>> * Contain only data files that are not referenced by other manifests >>> * Contain only delete files that reference data files already present >>> in the manifest-list >>> >>> If any of these assumptions is violated, the manifest files from the >>> manifests field are flushed to the manifest-list and the standard >>> commit procedure is applied. >>> Benefits >>> >>> - Significantly reduced write amplification for streaming inserts >>> - Simplifies conflict resolution by the catalog. If two concurrent >>> writes occur, the entries in the manifests field can simply be merged >>> together >>> - Leverages existing Iceberg metadata constructs >>> - Maintains compatibility with current catalog operations >>> >>> Note: While this proposal suggests repurposing the manifests field, we >>> could alternatively implement this as a new metadata field if preferred. >>> >>> I'd appreciate your thoughts on this approach and welcome any feedback >>> or concerns. >>> >>> Best regards, >>> >>> Jan >>> >>