Hi Edgar, Thanks for the well described proposal!
Knowing the Flink connector, I have the following concerns: - Flink connector currently doesn't sort the rows in the data files. It "chickens" out of this to avoid keeping anything in memory. - Sorting the equality delete rows would also add memory pressure to the writers. - Bucketing is good solution when the number or rows/changes is stable, but changes in load could require rebuckeing (reparationing). This would cause issues with the scope of the equality delete files. I have collected my concerns, but this doesn't mean that we should not consider this proposal, if the community would like to keep the equality delete files, and we don't deprecate them. Thanks, Peter On Tue, Apr 1, 2025, 11:53 Edgar Rodriguez <edgar.rodrig...@airbnb.com.invalid> wrote: > Hi all, > > I know there's been some conversations regarding optimization of equality > deletes and even their possible deprecation. We have been thinking > internally about a way to optimize merge-on-read with equality deletes to > better balance the read performance while having the benefits of performant > writes by encoding the equality deletes in streaming engines. > > The following are our high level thoughts and we'd like to share with the > community to see if we are thinking the right way about this in terms of > the Iceberg spec or if there are any obvious issues with the following > approach that we've missed. We'd appreciate the feedback and if you see > that at a high-level this sounds reasonable, we could follow up the > presented ideas with a more in-detail design doc to review. > > *Motivation* > Iceberg currently supports two types of deletes for merge-on-read: > > 1. Positional deletes stored as sorted tuples: (file name, position). > 2. Equality deletes, stored in roughly the same schema as the data > files, where a value in a column indicates an equality added to the > conjunction. > > Since positional deletes are sorted the same way the data is (i.e. by file > position), a simple sorted merge can be applied, which is very efficient > for the reader. Equality deletes have to be joined in some way, either by > shuffling the data or by broadcasting the equality deletes. > > Writers rarely know the positions of the rows that need to be deleted, > instead the data to be deleted is identified at write time (impacting write > performance) via some unique key. As a result, to optimize deletes to > balance out better performance for read/write generally the following two > options are considered: > > 1. Encode deletions using equality deletes. Periodically run a > maintenance job that rewrites equality deletes into positional deletes to > maintain acceptable read-side performance. > 2. Maintain an index of all the data in data files and write > positional deletes directly by converting the unique key into a positional > delete tuple. > > > Both of these implementations have significant downsides: > > - In the former case read-side performance is dependent on the number > of modifications applied to the table and the frequency of running the > maintenance job. Sudden change in table workload (e.g. a backfill modifying > large number of rows) can affect read performance significantly. > Additionally, the maintenance job performs computation proportional to the > total number of rows in the table, which is larger than the mutated set and > tends to grow over time. > - In the latter case, the writer has to maintain the index, which for > large tables can be measured in TBs and can lead to long checkpoints and > restarts. > > One observation is that the significant speed up on the read side is > coming from the fact that positional deletes and data files are sorted in > the same way. This proposal describes sorted equality deletes with > bucketing that provide a middle ground between the two encodings that is > both reader and writer friendly. > > *Requirements* > > - All existing jobs and Iceberg tables should continue to work without > any effect, unless the optimization is explicitly enabled. > - Only applicable for tables that specify sort order. > > > *Proposal: Sorted Equality Deletes* > The current Iceberg spec provides a way to encode sorting > <https://iceberg.apache.org/spec/#sorting> applied to data files and/or > delete files, however that sorting order is currently not enforced for the > writers. Violation of that sort order may affect performance, but not > correctness. The proposal here is the following: > > 1. Introduce a configuration option (e.g sort-enforced) that allows > taking into account the sorted nature of data files and equality delete > files during query planning, this means that if the new setting is set but > a compute engine is not enforcing the sort, the write will fail. There > would also be a need to tag data files as "data-sorted" to ensure that all > data files and delete files involved in a scan are sorted as well. > 2. During data scans the following optimization is added: if a data > file and all delete files that can affect it are sorted on the same > columns, a sorted merge scan is used instead of joins. > > *Performance* > Compared to positional deletes, sorted equality deletes DO NOT > automatically prune the data files affected by the deletes (this by > design), therefore, we should generally expect positional deletes to > perform better. However, this effect can be mitigated by a linear factor > using a partition spec with a bucketing transformation. For example, > partitioning the data by the unique key into 16 buckets and producing 16 > sorted equality delete files with associated partition values, reduces the > amount of unrelated deletes that have to be scanned by a factor of 16. > > *Compatibility* > The optimization is enabled by the sort-enforced configuration setting and > is turned-off by default. Therefore, any existing jobs should continue to > work unaffected. > > Thanks for taking a look. > > Best, > > -- > Edgar R >