Hi all,
I'd like to propose an optimization for how we track manifest files in
Iceberg tables, specifically focusing on reducing write amplification
and simplifying conflict resolution during fast-append operations.
Background: Replace vs. Change-Based Updates
To frame this proposal, let's first consider two approaches to state
management in table systems:
1. Replace-based updates: The entire state is replaced with each update.
This is how Iceberg currently handles manifest files - when new
manifests are added, we create an entirely new snapshot.
2. Change-based updates: Only incremental changes are tracked and
replayed to derive the current state. This is similar to how Delta
tables track data files.
While Iceberg initially used purely replace-based updates, we've already
successfully adopted change-based updates for the top-level table
metadata with the REST catalog. Instead of uploading entire table
metadata, we now only upload new snapshots during update-table operations.
Proposed Enhancement
I propose extending this change-based approach to manifest file
tracking, specifically for fast-append operations. Here's how:
1. Repurpose the manifests field as a buffer to track new manifest file
additions
2. Define the complete set of manifest files as the union of:
- Manifest files from the manifest-list
- Manifest files from the manifests field
Implementation Details
- When performing fast-append operations:
* New manifest files are added to the manifests field
* Changes are committed via update-table catalog operation
* The manifest-list remains unchanged, eliminating write amplification
- After a configured number of fast-appends:
* Manifest files are removed from the manifests field
* Files are consolidated into a new manifest-list
* The manifest files are assigned the sequence-number of the snapshot
when they are written to the manifest-list
Constraints and Considerations
For this approach to work effectively, manifest files in the manifests
field must:
* Contain only data files that are not referenced by other manifests
* Contain only delete files that reference data files already
present in the manifest-list
If any of these assumptions is violated, the manifest files from the
manifests field are flushed to the manifest-list and the standard commit
procedure is applied.
Benefits
- Significantly reduced write amplification for streaming inserts
- Simplifies conflict resolution by the catalog. If two concurrent
writes occur, the entries in the manifests field can simply be merged
together
- Leverages existing Iceberg metadata constructs
- Maintains compatibility with current catalog operations
Note: While this proposal suggests repurposing the manifests field, we
could alternatively implement this as a new metadata field if preferred.
I'd appreciate your thoughts on this approach and welcome any feedback
or concerns.
Best regards,
Jan