Currently we have a 'static' 2 level manifest structure. If we introduce the 'everything is a manifest' concept then we will remove the limit on the levels. This would prevent concurrent reading of the embedded manifests (if the table has 5 levels of embedded manifests the reader needs to read those files sequentially). This would result in a seemingly good, but unreadable table when the structure is not flattened periodically. So, while the everything is a manifest is a good, flexible structure, it has its own drawbacks.
On Fri, Nov 22, 2024, 18:56 Micah Kornfield <emkornfi...@gmail.com> wrote: > Would cadding the ability to have a list of manifest lists solve this > problem? This might be an incremental step to getting to "everything" is a > manifest? > > For now I wanted to reuse the existing manifest-list and manifests fields. > > > Regardless of the outcome, please let's not re-use a field in a way that > will change the semantics of the field this goes against good practices on > forward compatibility. > > Cheers, > Micah > > > > On Fri, Nov 22, 2024 at 9:31 AM Jan Kaul <jank...@mailbox.org.invalid> > wrote: > >> Thanks for your feedback. >> >> About your concerns Fokko: >> >> 1. Generally the number of manifest files in the manifests field >> shouldn't get too large. But I think you can already improve the write >> amplification and conflict resolution with using up to 10 manifest files. >> The fact that the manifests field only contains paths is not ideal and >> may be a reason to have a separate discussion on a new metadata field. >> However, the writer writing the manifest files could keep some kind of >> cache of the partition values, statistics so that it doesn't need to fetch >> the information when writing the manifest-list. This becomes an issue when >> multiple concurrent writers are at work, because they would still need to >> fetch the information from the files that they didn't write. >> As you mentioned, my approach would be to always include the manifest >> files from the manifests field in the query plan and only prune their >> manifest_entries. I would try to keep the number of manifest files in the >> manifests field small to reduce this effect, but this could definitely >> be a drawback. >> >> 2. Regarding the sequence-number inheritance, every manifest file in the >> manifests field should inherit the sequence-number from the snapshot >> that contains it. This means that all manifest files in the manifests >> field have the same sequence-number, which limits the capabilities of >> deletes. One could either limit deletes to only reference data files that >> are are already committed to the manifest-list or one might flush the >> manifest files from the manifests field every time a delete file is >> occurs. Essentially disabling the proposed behavior. It would still yield >> benefits for append only tables. >> The conflict resolution should be easier for most scenarios as the >> manifest-list does not need to be rewritten. For appends the new >> manifests field is the union of the manifest files of the conflicting >> manifests fields. >> >> About your concerns Russel: >> >> My motivation was to have a separation between a consolidated and a >> temporary list of manifest files. The contents of the temporary list >> regularly gets moved to the consolidated list. But the fact that the >> temporary list is small, reduces the impact of frequent rewrites and makes >> it easy to use set operations to resolve conflicts. These different lists >> could be stored as two different manifest files that contain other >> manifests or datafiles. For now I wanted to reuse the existing >> manifest-list and manifests fields. >> >> Thanks, >> >> Jan >> On 22.11.24 17:02, Russell Spitzer wrote: >> >> I would much rather we switch to the "everything is a manifest approach. >> Instead of manifest lists we only ever have manifests. A Manifest can then >> link to data files or additional manifests. In the case of streaming then >> you only ever have to read and write a single manifest. If we couple this >> with delete vectors we can greatly reduce the number of writes. I am >> generally against anything that puts additional (unbounded) content into >> the metadata.json. I'm not sure if anyone has written this up as a full >> proposal yet but I know it's been discussed a bunch. >> >> On Fri, Nov 22, 2024 at 9:31 AM Fokko Driesprong <fo...@apache.org> >> wrote: >> >>> Hi Jan, >>> >>> Thanks for sending out this proposal. While reading through it, two >>> questions pop up: >>> >>> - You mentioned repurposing the manifests field. Currently, this >>> field contains a list of paths that point to the manifest data. >>> Would this also be your suggestion? This way, when committing the >>> accumulated manifests into a manifest list, you would need to open up all >>> the manifests to get information like partition information, statistics, >>> etc. This way there is also no way to distinguish between data and delete >>> manifests without having to open the files, effectively always >>> including those files in the query plan. >>> - It is unclear to me if appending a manifest to the manifests will >>> create a new snapshot. I think it should. Either way, I think this >>> conflicts with the concept of sequence number inheritance >>> >>> <https://github.com/apache/iceberg/blob/main/format/spec.md#sequence-numbers>. >>> This is used to avoid having to rewrite a manifest when a conflict >>> occurs, >>> you only have to rewrite the manifest list. When there is a conflict, the >>> client that sees the conflict, will take the latest manifest-list, and >>> inherit in the sequence number. When you can append to the manifest list, >>> you won't be able to determine which snapshot has added the file. If you >>> wouldn't use inheritance, then you would need to rewrite the manifest on >>> a >>> conflict (because the sequence ID has been used already). >>> >>> I have to think a bit more about it but above are my concerns so far. >>> >>> Kind regards, >>> Fokko >>> >>> Op vr 22 nov 2024 om 15:26 schreef Jan Kaul >>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>: >>> >>>> Hi all, >>>> >>>> I'd like to propose an optimization for how we track manifest files in >>>> Iceberg tables, specifically focusing on reducing write amplification and >>>> simplifying conflict resolution during fast-append operations. >>>> Background: Replace vs. Change-Based Updates >>>> >>>> To frame this proposal, let's first consider two approaches to state >>>> management in table systems: >>>> >>>> 1. Replace-based updates: The entire state is replaced with each >>>> update. This is how Iceberg currently handles manifest files - when new >>>> manifests are added, we create an entirely new snapshot. >>>> >>>> 2. Change-based updates: Only incremental changes are tracked and >>>> replayed to derive the current state. This is similar to how Delta tables >>>> track data files. >>>> >>>> While Iceberg initially used purely replace-based updates, we've >>>> already successfully adopted change-based updates for the top-level table >>>> metadata with the REST catalog. Instead of uploading entire table metadata, >>>> we now only upload new snapshots during update-table operations. >>>> >>>> Proposed Enhancement >>>> >>>> I propose extending this change-based approach to manifest file >>>> tracking, specifically for fast-append operations. Here's how: >>>> >>>> 1. Repurpose the manifests field as a buffer to track new manifest >>>> file additions >>>> 2. Define the complete set of manifest files as the union of: >>>> - Manifest files from the manifest-list >>>> - Manifest files from the manifests field >>>> >>>> Implementation Details >>>> >>>> - When performing fast-append operations: >>>> * New manifest files are added to the manifests field >>>> * Changes are committed via update-table catalog operation >>>> * The manifest-list remains unchanged, eliminating write amplification >>>> >>>> - After a configured number of fast-appends: >>>> * Manifest files are removed from the manifests field >>>> * Files are consolidated into a new manifest-list >>>> * The manifest files are assigned the sequence-number of the snapshot >>>> when they are written to the manifest-list >>>> Constraints and Considerations >>>> >>>> For this approach to work effectively, manifest files in the manifests >>>> field must: >>>> * Contain only data files that are not referenced by other manifests >>>> * Contain only delete files that reference data files already >>>> present in the manifest-list >>>> >>>> If any of these assumptions is violated, the manifest files from the >>>> manifests field are flushed to the manifest-list and the standard >>>> commit procedure is applied. >>>> Benefits >>>> >>>> - Significantly reduced write amplification for streaming inserts >>>> - Simplifies conflict resolution by the catalog. If two concurrent >>>> writes occur, the entries in the manifests field can simply be merged >>>> together >>>> - Leverages existing Iceberg metadata constructs >>>> - Maintains compatibility with current catalog operations >>>> >>>> Note: While this proposal suggests repurposing the manifests field, we >>>> could alternatively implement this as a new metadata field if preferred. >>>> >>>> I'd appreciate your thoughts on this approach and welcome any feedback >>>> or concerns. >>>> >>>> Best regards, >>>> >>>> Jan >>>> >>>