Thanks for your feedback.
About your concerns Fokko:
1. Generally the number of manifest files in the manifests field
shouldn't get too large. But I think you can already improve the write
amplification and conflict resolution with using up to 10 manifest
files. The fact that the manifests field only contains paths is not
ideal and may be a reason to have a separate discussion on a new
metadata field.
However, the writer writing the manifest files could keep some kind of
cache of the partition values, statistics so that it doesn't need to
fetch the information when writing the manifest-list. This becomes an
issue when multiple concurrent writers are at work, because they would
still need to fetch the information from the files that they didn't write.
As you mentioned, my approach would be to always include the manifest
files from the manifests field in the query plan and only prune their
manifest_entries. I would try to keep the number of manifest files in
the manifests field small to reduce this effect, but this could
definitely be a drawback.
2. Regarding the sequence-number inheritance, every manifest file in
the manifests field should inherit the sequence-number from the snapshot
that contains it. This means that all manifest files in the manifests
field have the same sequence-number, which limits the capabilities of
deletes. One could either limit deletes to only reference data files
that are are already committed to the manifest-list or one might flush
the manifest files from the manifests field every time a delete file is
occurs. Essentially disabling the proposed behavior. It would still
yield benefits for append only tables.
The conflict resolution should be easier for most scenarios as the
manifest-list does not need to be rewritten. For appends the new
manifests field is the union of the manifest files of the conflicting
manifests fields.
About your concerns Russel:
My motivation was to have a separation between a consolidated and a
temporary list of manifest files. The contents of the temporary list
regularly gets moved to the consolidated list. But the fact that the
temporary list is small, reduces the impact of frequent rewrites and
makes it easy to use set operations to resolve conflicts. These
different lists could be stored as two different manifest files that
contain other manifests or datafiles. For now I wanted to reuse the
existing manifest-list and manifests fields.
Thanks,
Jan
On 22.11.24 17:02, Russell Spitzer wrote:
I would much rather we switch to the "everything is a manifest
approach. Instead of manifest lists we only ever have manifests. A
Manifest can then link to data files or additional manifests. In the
case of streaming then you only ever have to read and write a single
manifest. If we couple this with delete vectors we can greatly reduce
the number of writes. I am generally against anything that puts
additional (unbounded) content into the metadata.json. I'm not sure if
anyone has written this up as a full proposal yet but I know it's been
discussed a bunch.
On Fri, Nov 22, 2024 at 9:31 AM Fokko Driesprong <fo...@apache.org> wrote:
Hi Jan,
Thanks for sending out this proposal. While reading through it,
two questions pop up:
* You mentioned repurposing the manifestsfield. Currently, this
field contains a list of paths that point to the manifest
data. Would this also be your suggestion? This way, when
committing the accumulated manifests into a manifest list, you
would need to open up all the manifests to get information
like partition information, statistics, etc. This way there is
also no way to distinguish between data and delete manifests
without having to open the files, effectively always
including those files in the query plan.
* It is unclear to me if appending a manifest to the manifests
will create a new snapshot. I think it should. Either way, I
think this conflicts with the concept of sequence number
inheritance
<https://github.com/apache/iceberg/blob/main/format/spec.md#sequence-numbers>.
This is used to avoid having to rewrite a manifest when a
conflict occurs, you only have to rewrite the manifest list.
When there is a conflict, the client that sees the conflict,
will take the latest manifest-list, and inherit in the
sequence number. When you can append to the manifest list, you
won't be able to determine which snapshot has added the file.
If you wouldn't use inheritance, then you would need to
rewrite the manifest on a conflict (because the sequence ID
has been used already).
I have to think a bit more about it but above are my concerns so far.
Kind regards,
Fokko
Op vr 22 nov 2024 om 15:26 schreef Jan Kaul
<jank...@mailbox.org.invalid>:
Hi all,
I'd like to propose an optimization for how we track manifest
files in Iceberg tables, specifically focusing on reducing
write amplification and simplifying conflict resolution during
fast-append operations.
Background: Replace vs. Change-Based Updates
To frame this proposal, let's first consider two approaches to
state management in table systems:
1. Replace-based updates: The entire state is replaced with
each update. This is how Iceberg currently handles manifest
files - when new manifests are added, we create an entirely
new snapshot.
2. Change-based updates: Only incremental changes are tracked
and replayed to derive the current state. This is similar to
how Delta tables track data files.
While Iceberg initially used purely replace-based updates,
we've already successfully adopted change-based updates for
the top-level table metadata with the REST catalog. Instead of
uploading entire table metadata, we now only upload new
snapshots during update-table operations.
Proposed Enhancement
I propose extending this change-based approach to manifest
file tracking, specifically for fast-append operations. Here's
how:
1. Repurpose the manifests field as a buffer to track new
manifest file additions
2. Define the complete set of manifest files as the union of:
- Manifest files from the manifest-list
- Manifest files from the manifests field
Implementation Details
- When performing fast-append operations:
* New manifest files are added to the manifests field
* Changes are committed via update-table catalog operation
* The manifest-list remains unchanged, eliminating write
amplification
- After a configured number of fast-appends:
* Manifest files are removed from the manifests field
* Files are consolidated into a new manifest-list
* The manifest files are assigned the sequence-number of the
snapshot when they are written to the manifest-list
Constraints and Considerations
For this approach to work effectively, manifest files in the
manifests field must:
* Contain only data files that are not referenced by other
manifests
* Contain only delete files that reference data files
already present in the manifest-list
If any of these assumptions is violated, the manifest files
from the manifests field are flushed to the manifest-list and
the standard commit procedure is applied.
Benefits
- Significantly reduced write amplification for streaming inserts
- Simplifies conflict resolution by the catalog. If two
concurrent writes occur, the entries in the manifests field
can simply be merged together
- Leverages existing Iceberg metadata constructs
- Maintains compatibility with current catalog operations
Note: While this proposal suggests repurposing the manifests
field, we could alternatively implement this as a new metadata
field if preferred.
I'd appreciate your thoughts on this approach and welcome any
feedback or concerns.
Best regards,
Jan