Hi team,

We've been exploring ways to optimize and balance read and write
performance in merge-on-read scenarios for a while.

Below are our early ideas, and we’d appreciate community feedback to help
validate them against the Iceberg spec, especially any edge cases we might
have missed. We’re also working on a detailed design doc and prototype to
improve equality delete reads, and would appreciate suggestions for
benchmarking strategies.


*Motivation for the change*

Iceberg currently supports two types of deletes for merge-on-read mode,
each with distinct advantages and trade-offs:

   1. equality deletes
      1. Pros: instant, global deletions without read overhead, better
      support for concurrent writes
      2. Cons: read performance can significantly degrade over time due to
      delete file amplification
   2. Positional deletes
      1. Pros: Minimal overhead during reads because only matching row
      positions are skipped per file.
      2. Cons: Writers don’t know row positions upfront, requiring full
      scans to generate positional deletes.

By maintaining an inverted index that maps column values to row positions,
we can address the drawbacks of both equality and positional deletes. For
instance, this index would allow us to directly convert equality conditions
into positional delete tuples, improving read performance for equality
deletes and write performance for positional deletes.

Current equality delete read flow ->

Each equality delete entry is converted into a Predicate. These predicates
are combined into a filter, and every record is evaluated sequentially
against all predicates until a match is found. This results in a time
complexity of *O(M × N)*, where *M* is the number of delete entries and *N* is
the number of data records.

Suggested equality delete read flow with index ->

Instead of evaluating each record against all delete predicates, we use an
index to directly retrieve the row positions for the delete entries. This
lookup has a time complexity of *O(M)*. If we consider index loading time
as *O(N)*, we can then sort the positions and efficiently filter out the
records during the scan. As a result, the overall time complexity is
reduced to *O(N)*, with N being the dominant factor.

This approach would work. That said, deletes can target any columns, we
should consider different indexing strategies:

   - *Low to medium cardinality columns*: Bitmap indexes work best due to
   their space efficiency.
   - *High cardinality columns*: Bitmap indexes become space-inefficient,
   tree-based or hash indexes are more suitable.
   - *Upsert cases* (where all columns are included in the equality delete
   entry): Encoding all columns with a hash based index is more effective.

*Implementation proposal*

*How to map and reference index files*

Index files are built per data file and per column. Since data files are
immutable, the index only needs to be generated once and its lifecycle is
tied to the file. Indexing is done per column to support both single-column
and multi-column indexes. Since a data file can have multiple associated
index files, referencing index files directly from the data file is more
efficient than using a separate index manifest (as is done with delete
files). This approach avoids inflating the manifest list and helps maintain
normal write performance. Because unlike delete files, which can be
compacted and are typically proportional to data file, index files persist
and grow alongside the data file.

*How to build index and manage index files*

We will provide index builders to create indexes based on the metadata
outlined below and register the index file with the corresponding data
file. This process can be executed asynchronously with applicable write
operations, and should not impact the normal write performance.

IndexFile Metadata

   - *index_file_uri*: URI or path to the index file
   - *index_file_type*: Type of index file (e.g., Parquet, custom format)
   - *index_type*: Type of index (e.g., bitmap, hash)
   - *referenced_data_file*: The data file this index is built for
   - *referenced_column_ids*: IDs of the columns the index covers

The current orphan file removal process will be updated to also remove
index files. Additionally, we could provide a way for users to list all
their index files, allowing them to choose and delete specific index
files/indexes if needed.

In order for indexes to still work after type promotion, we should consider
building indexes against promoted types directly.

*How to store index files*

While the community previously aligned on using Puffin files to store index
data, index files are typically larger and more varied than stats, so it
may not be the best fit for all cases. It would be valuable to benchmark
alternative formats for performance. Open to other suggestions as well.

*How to use index files*

Since index files are maintained at the file level, they will be utilized
during file scan tasks. The process will first consult the index to either:

   - Skip the scan entirely if no records match
   - Skip predicates evaluation and use delete positions directly when
   reading equality deletes
   - Skip the scan and return delete positions directly when writing
   positional deletes
   - Provide values directly in the case of a covering index.

*Breaking changes/incompatibilities*

Existing tables without index files should continue to work as-is. Queries
that use indexes should not introduce regressions.

While the original goal is to improve merge-on-read performance, we aim to
build a pluggable API for creating and querying indexes. Since indexes can
support a wide range of queries and use cases, the design should be
flexible and extensible. Another benefit of making this pluggable is that
it can work well with all new data types Iceberg will support as long as
they can be indexed.

Thanks for reviewing the doc[1], we love feedback!


Best,

Xiaoxuan Li


[1] - https://github.com/apache/iceberg/issues/13000

Reply via email to