Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

Péter Váry Fri, 22 Nov 2024 23:21:18 -0800

Currently we have a 'static' 2 level manifest structure. If we introduce
the 'everything is a manifest' concept then we will remove the limit on the
levels. This would prevent concurrent reading of the embedded manifests (if
the table has 5 levels of embedded manifests the reader needs to read those
files sequentially). This would result in a seemingly good, but unreadable
table when the structure is not flattened periodically.
So, while the everything is a manifest is a good, flexible structure, it
has its own drawbacks.


On Fri, Nov 22, 2024, 18:56 Micah Kornfield <[email protected]> wrote:

> Would cadding the ability to have a list of manifest lists solve this
> problem?  This might be an incremental step to getting to "everything" is a
> manifest?
>
> For now I wanted to reuse the existing manifest-list and manifests fields.
>
>
> Regardless of the outcome, please let's not re-use a field in a way that
> will change the semantics of the field this goes against good practices on
> forward compatibility.
>
> Cheers,
> Micah
>
>
>
> On Fri, Nov 22, 2024 at 9:31 AM Jan Kaul <[email protected]>
> wrote:
>
>> Thanks for your feedback.
>>
>> About your concerns Fokko:
>>
>> 1. Generally the number of manifest files in the manifests field
>> shouldn't get too large. But I think you can already improve the write
>> amplification and conflict resolution with using up to 10 manifest files.
>> The fact that the manifests field only contains paths is not ideal and
>> may be a reason to have a separate discussion on a new metadata field.
>> However, the writer writing the manifest files could keep some kind of
>> cache of the partition values, statistics so that it doesn't need to fetch
>> the information when writing the manifest-list. This becomes an issue when
>> multiple concurrent writers are at work, because they would still need to
>> fetch the information from the files that they didn't write.
>> As you mentioned, my approach would be to always include the manifest
>> files from the manifests field in the query plan and only prune their
>> manifest_entries. I would try to keep the number of manifest files in the
>> manifests field small to reduce this effect, but this could definitely
>> be a drawback.
>>
>> 2.  Regarding the sequence-number inheritance, every manifest file in the
>> manifests field should inherit the sequence-number from the snapshot
>> that contains it. This means that all manifest files in the manifests
>> field have the same sequence-number, which limits the capabilities of
>> deletes. One could either limit deletes to only reference data files that
>> are are already committed to the manifest-list or one might flush the
>> manifest files from the manifests field every time a delete file is
>> occurs. Essentially disabling the proposed behavior. It would still yield
>> benefits for append only tables.
>> The conflict resolution should be easier for most scenarios as the
>> manifest-list does not need to be rewritten. For appends the new
>> manifests field is the union of the manifest files of the conflicting
>> manifests fields.
>>
>> About your concerns Russel:
>>
>> My motivation was to have a separation between a consolidated and a
>> temporary list of manifest files. The contents of the temporary list
>> regularly gets moved to the consolidated list. But the fact that the
>> temporary list is small, reduces the impact of frequent rewrites and makes
>> it easy to use set operations to resolve conflicts. These different lists
>> could be stored as two different manifest files that contain other
>> manifests or datafiles. For now I wanted to reuse the existing
>> manifest-list and manifests fields.
>>
>> Thanks,
>>
>> Jan
>> On 22.11.24 17:02, Russell Spitzer wrote:
>>
>> I would much rather we switch to the "everything is a manifest approach.
>> Instead of manifest lists we only ever have manifests. A Manifest can then
>> link to data files or additional manifests. In the case of streaming then
>> you only ever have to read and write a single manifest. If we couple this
>> with delete vectors we can greatly reduce the number of writes. I am
>> generally against anything that puts additional (unbounded) content into
>> the metadata.json. I'm not sure if anyone has written this up as a full
>> proposal yet but I know it's been discussed a bunch.
>>
>> On Fri, Nov 22, 2024 at 9:31 AM Fokko Driesprong <[email protected]>
>> wrote:
>>
>>> Hi Jan,
>>>
>>> Thanks for sending out this proposal. While reading through it, two
>>> questions pop up:
>>>
>>>    - You mentioned repurposing the manifests field. Currently, this
>>>    field contains a list of paths that point to the manifest data.
>>>    Would this also be your suggestion? This way, when committing the
>>>    accumulated manifests into a manifest list, you would need to open up all
>>>    the manifests to get information like partition information, statistics,
>>>    etc. This way there is also no way to distinguish between data and delete
>>>    manifests without having to open the files, effectively always
>>>    including those files in the query plan.
>>>    - It is unclear to me if appending a manifest to the manifests will
>>>    create a new snapshot. I think it should. Either way, I think this
>>>    conflicts with the concept of sequence number inheritance
>>>    
>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#sequence-numbers>.
>>>    This is used to avoid having to rewrite a manifest when a conflict 
>>> occurs,
>>>    you only have to rewrite the manifest list. When there is a conflict, the
>>>    client that sees the conflict, will take the latest manifest-list, and
>>>    inherit in the sequence number. When you can append to the manifest list,
>>>    you won't be able to determine which snapshot has added the file. If you
>>>    wouldn't use inheritance, then you would need to rewrite the manifest on 
>>> a
>>>    conflict (because the sequence ID has been used already).
>>>
>>> I have to think a bit more about it but above are my concerns so far.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op vr 22 nov 2024 om 15:26 schreef Jan Kaul
>>> <[email protected]> <[email protected]>:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to propose an optimization for how we track manifest files in
>>>> Iceberg tables, specifically focusing on reducing write amplification and
>>>> simplifying conflict resolution during fast-append operations.
>>>> Background: Replace vs. Change-Based Updates
>>>>
>>>> To frame this proposal, let's first consider two approaches to state
>>>> management in table systems:
>>>>
>>>> 1. Replace-based updates: The entire state is replaced with each
>>>> update. This is how Iceberg currently handles manifest files - when new
>>>> manifests are added, we create an entirely new snapshot.
>>>>
>>>> 2. Change-based updates: Only incremental changes are tracked and
>>>> replayed to derive the current state. This is similar to how Delta tables
>>>> track data files.
>>>>
>>>> While Iceberg initially used purely replace-based updates, we've
>>>> already successfully adopted change-based updates for the top-level table
>>>> metadata with the REST catalog. Instead of uploading entire table metadata,
>>>> we now only upload new snapshots during update-table operations.
>>>>
>>>> Proposed Enhancement
>>>>
>>>> I propose extending this change-based approach to manifest file
>>>> tracking, specifically for fast-append operations. Here's how:
>>>>
>>>> 1. Repurpose the manifests field as a buffer to track new manifest
>>>> file additions
>>>> 2. Define the complete set of manifest files as the union of:
>>>>    - Manifest files from the manifest-list
>>>>    - Manifest files from the manifests field
>>>>
>>>> Implementation Details
>>>>
>>>> - When performing fast-append operations:
>>>>   * New manifest files are added to the manifests field
>>>>   * Changes are committed via update-table catalog operation
>>>>   * The manifest-list remains unchanged, eliminating write amplification
>>>>
>>>> - After a configured number of fast-appends:
>>>>   * Manifest files are removed from the manifests field
>>>>   * Files are consolidated into a new manifest-list
>>>>   * The manifest files are assigned the sequence-number of the snapshot
>>>> when they are written to the manifest-list
>>>> Constraints and Considerations
>>>>
>>>> For this approach to work effectively, manifest files in the manifests
>>>> field must:
>>>>    * Contain only data files that are not referenced by other manifests
>>>>    * Contain only delete files that reference data files already
>>>> present in the manifest-list
>>>>
>>>> If any of these assumptions is violated, the manifest files from the
>>>> manifests field are flushed to the manifest-list and the standard
>>>> commit procedure is applied.
>>>> Benefits
>>>>
>>>> - Significantly reduced write amplification for streaming inserts
>>>> - Simplifies conflict resolution by the catalog. If two concurrent
>>>> writes occur, the entries in the manifests field can simply be merged
>>>> together
>>>> - Leverages existing Iceberg metadata constructs
>>>> - Maintains compatibility with current catalog operations
>>>>
>>>> Note: While this proposal suggests repurposing the manifests field, we
>>>> could alternatively implement this as a new metadata field if preferred.
>>>>
>>>> I'd appreciate your thoughts on this approach and welcome any feedback
>>>> or concerns.
>>>>
>>>> Best regards,
>>>>
>>>> Jan
>>>>
>>>

Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

Reply via email to