Hi Xiaoxuan Thanks to the proposal, the equality delete was designed initially for the fast upserts, such as the upstream cdc stream can be streamed into the iceberg directly, with relatively good freshness. I agreed that if we talked about the best performance then it is partially implemented, there is still some work that we need to do, for a better performance.
The thing that I am curious most is: what are the typical business or sceniorie cases that you are trying to address ? Maybe I can give more feedback, when we are talking about the real cases , and with its design. Best, Zheng! On Wed, May 7, 2025 at 3:25 PM Xiaoxuan Li <xiaoxuan.li....@gmail.com> wrote: > Hi team, > > We've been exploring ways to optimize and balance read and write > performance in merge-on-read scenarios for a while. > > Below are our early ideas, and we’d appreciate community feedback to help > validate them against the Iceberg spec, especially any edge cases we might > have missed. We’re also working on a detailed design doc and prototype to > improve equality delete reads, and would appreciate suggestions for > benchmarking strategies. > > > *Motivation for the change* > > Iceberg currently supports two types of deletes for merge-on-read mode, > each with distinct advantages and trade-offs: > > 1. equality deletes > 1. Pros: instant, global deletions without read overhead, better > support for concurrent writes > 2. Cons: read performance can significantly degrade over time due > to delete file amplification > 2. Positional deletes > 1. Pros: Minimal overhead during reads because only matching row > positions are skipped per file. > 2. Cons: Writers don’t know row positions upfront, requiring full > scans to generate positional deletes. > > By maintaining an inverted index that maps column values to row positions, > we can address the drawbacks of both equality and positional deletes. For > instance, this index would allow us to directly convert equality conditions > into positional delete tuples, improving read performance for equality > deletes and write performance for positional deletes. > > Current equality delete read flow -> > > Each equality delete entry is converted into a Predicate. These predicates > are combined into a filter, and every record is evaluated sequentially > against all predicates until a match is found. This results in a time > complexity of *O(M × N)*, where *M* is the number of delete entries and > *N* is the number of data records. > > Suggested equality delete read flow with index -> > > Instead of evaluating each record against all delete predicates, we use an > index to directly retrieve the row positions for the delete entries. This > lookup has a time complexity of *O(M)*. If we consider index loading time > as *O(N)*, we can then sort the positions and efficiently filter out the > records during the scan. As a result, the overall time complexity is > reduced to *O(N)*, with N being the dominant factor. > > This approach would work. That said, deletes can target any columns, we > should consider different indexing strategies: > > - *Low to medium cardinality columns*: Bitmap indexes work best due to > their space efficiency. > - *High cardinality columns*: Bitmap indexes become space-inefficient, > tree-based or hash indexes are more suitable. > - *Upsert cases* (where all columns are included in the equality > delete entry): Encoding all columns with a hash based index is more > effective. > > *Implementation proposal* > > *How to map and reference index files* > > Index files are built per data file and per column. Since data files are > immutable, the index only needs to be generated once and its lifecycle is > tied to the file. Indexing is done per column to support both single-column > and multi-column indexes. Since a data file can have multiple associated > index files, referencing index files directly from the data file is more > efficient than using a separate index manifest (as is done with delete > files). This approach avoids inflating the manifest list and helps maintain > normal write performance. Because unlike delete files, which can be > compacted and are typically proportional to data file, index files persist > and grow alongside the data file. > > *How to build index and manage index files* > > We will provide index builders to create indexes based on the metadata > outlined below and register the index file with the corresponding data > file. This process can be executed asynchronously with applicable write > operations, and should not impact the normal write performance. > > IndexFile Metadata > > - *index_file_uri*: URI or path to the index file > - *index_file_type*: Type of index file (e.g., Parquet, custom format) > - *index_type*: Type of index (e.g., bitmap, hash) > - *referenced_data_file*: The data file this index is built for > - *referenced_column_ids*: IDs of the columns the index covers > > The current orphan file removal process will be updated to also remove > index files. Additionally, we could provide a way for users to list all > their index files, allowing them to choose and delete specific index > files/indexes if needed. > > In order for indexes to still work after type promotion, we should > consider building indexes against promoted types directly. > > *How to store index files* > > While the community previously aligned on using Puffin files to store > index data, index files are typically larger and more varied than stats, so > it may not be the best fit for all cases. It would be valuable to benchmark > alternative formats for performance. Open to other suggestions as well. > > *How to use index files* > > Since index files are maintained at the file level, they will be utilized > during file scan tasks. The process will first consult the index to either: > > - Skip the scan entirely if no records match > - Skip predicates evaluation and use delete positions directly when > reading equality deletes > - Skip the scan and return delete positions directly when writing > positional deletes > - Provide values directly in the case of a covering index. > > *Breaking changes/incompatibilities* > > Existing tables without index files should continue to work as-is. Queries > that use indexes should not introduce regressions. > > While the original goal is to improve merge-on-read performance, we aim to > build a pluggable API for creating and querying indexes. Since indexes can > support a wide range of queries and use cases, the design should be > flexible and extensible. Another benefit of making this pluggable is that > it can work well with all new data types Iceberg will support as long as > they can be indexed. > > Thanks for reviewing the doc[1], we love feedback! > > > Best, > > Xiaoxuan Li > > > [1] - https://github.com/apache/iceberg/issues/13000 >