Hi all,

I'd like to propose an optimization for how we track manifest files in Iceberg tables, specifically focusing on reducing write amplification and simplifying conflict resolution during fast-append operations.


       Background: Replace vs. Change-Based Updates

To frame this proposal, let's first consider two approaches to state management in table systems:

1. Replace-based updates: The entire state is replaced with each update. This is how Iceberg currently handles manifest files - when new manifests are added, we create an entirely new snapshot.

2. Change-based updates: Only incremental changes are tracked and replayed to derive the current state. This is similar to how Delta tables track data files.

While Iceberg initially used purely replace-based updates, we've already successfully adopted change-based updates for the top-level table metadata with the REST catalog. Instead of uploading entire table metadata, we now only upload new snapshots during update-table operations.


       Proposed Enhancement

I propose extending this change-based approach to manifest file tracking, specifically for fast-append operations. Here's how:

1. Repurpose the manifests field as a buffer to track new manifest file additions
2. Define the complete set of manifest files as the union of:
   - Manifest files from the manifest-list
   - Manifest files from the manifests field


       Implementation Details

- When performing fast-append operations:
  * New manifest files are added to the manifests field
  * Changes are committed via update-table catalog operation
  * The manifest-list remains unchanged, eliminating write amplification

- After a configured number of fast-appends:
  * Manifest files are removed from the manifests field
  * Files are consolidated into a new manifest-list
  * The manifest files are assigned the sequence-number of the snapshot when they are written to the manifest-list


       Constraints and Considerations

For this approach to work effectively, manifest files in the manifests field must:
   * Contain only data files that are not referenced by other manifests
   * Contain only delete files that reference data files already present in the manifest-list

If any of these assumptions is violated, the manifest files from the manifests field are flushed to the manifest-list and the standard commit procedure is applied.


       Benefits

- Significantly reduced write amplification for streaming inserts
- Simplifies conflict resolution by the catalog. If two concurrent writes occur, the entries in the manifests field can simply be merged together
- Leverages existing Iceberg metadata constructs
- Maintains compatibility with current catalog operations

Note: While this proposal suggests repurposing the manifests field, we could alternatively implement this as a new metadata field if preferred.

I'd appreciate your thoughts on this approach and welcome any feedback or concerns.

Best regards,

Jan

Reply via email to