CMIW, the spec does not enforce `identifier fields` for equality delete files. Engines are free to use different `equality_ids` among commits, though the use case should be rare. Similarly, what sort order should we use? It is common for a table to set sort order on columns other than the primary keys. It would be more complicated if `equality_ids` vary among commits.
Best, Gang On Wed, Apr 2, 2025 at 6:48 PM Péter Váry <peter.vary.apa...@gmail.com> wrote: > Hi Edgar, > > Thanks for the well described proposal! > > Knowing the Flink connector, I have the following concerns: > - Flink connector currently doesn't sort the rows in the data files. It > "chickens" out of this to avoid keeping anything in memory. > - Sorting the equality delete rows would also add memory pressure to the > writers. > - Bucketing is good solution when the number or rows/changes is stable, > but changes in load could require rebuckeing (reparationing). This would > cause issues with the scope of the equality delete files. > > I have collected my concerns, but this doesn't mean that we should not > consider this proposal, if the community would like to keep the equality > delete files, and we don't deprecate them. > > Thanks, Peter > > > > > On Tue, Apr 1, 2025, 11:53 Edgar Rodriguez > <edgar.rodrig...@airbnb.com.invalid> wrote: > >> Hi all, >> >> I know there's been some conversations regarding optimization of equality >> deletes and even their possible deprecation. We have been thinking >> internally about a way to optimize merge-on-read with equality deletes to >> better balance the read performance while having the benefits of performant >> writes by encoding the equality deletes in streaming engines. >> >> The following are our high level thoughts and we'd like to share with the >> community to see if we are thinking the right way about this in terms of >> the Iceberg spec or if there are any obvious issues with the following >> approach that we've missed. We'd appreciate the feedback and if you see >> that at a high-level this sounds reasonable, we could follow up the >> presented ideas with a more in-detail design doc to review. >> >> *Motivation* >> Iceberg currently supports two types of deletes for merge-on-read: >> >> 1. Positional deletes stored as sorted tuples: (file name, position). >> 2. Equality deletes, stored in roughly the same schema as the data >> files, where a value in a column indicates an equality added to the >> conjunction. >> >> Since positional deletes are sorted the same way the data is (i.e. by >> file position), a simple sorted merge can be applied, which is very >> efficient for the reader. Equality deletes have to be joined in some way, >> either by shuffling the data or by broadcasting the equality deletes. >> >> Writers rarely know the positions of the rows that need to be deleted, >> instead the data to be deleted is identified at write time (impacting write >> performance) via some unique key. As a result, to optimize deletes to >> balance out better performance for read/write generally the following two >> options are considered: >> >> 1. Encode deletions using equality deletes. Periodically run a >> maintenance job that rewrites equality deletes into positional deletes to >> maintain acceptable read-side performance. >> 2. Maintain an index of all the data in data files and write >> positional deletes directly by converting the unique key into a positional >> delete tuple. >> >> >> Both of these implementations have significant downsides: >> >> - In the former case read-side performance is dependent on the number >> of modifications applied to the table and the frequency of running the >> maintenance job. Sudden change in table workload (e.g. a backfill >> modifying >> large number of rows) can affect read performance significantly. >> Additionally, the maintenance job performs computation proportional to the >> total number of rows in the table, which is larger than the mutated set >> and >> tends to grow over time. >> - In the latter case, the writer has to maintain the index, which for >> large tables can be measured in TBs and can lead to long checkpoints and >> restarts. >> >> One observation is that the significant speed up on the read side is >> coming from the fact that positional deletes and data files are sorted in >> the same way. This proposal describes sorted equality deletes with >> bucketing that provide a middle ground between the two encodings that is >> both reader and writer friendly. >> >> *Requirements* >> >> - All existing jobs and Iceberg tables should continue to work >> without any effect, unless the optimization is explicitly enabled. >> - Only applicable for tables that specify sort order. >> >> >> *Proposal: Sorted Equality Deletes* >> The current Iceberg spec provides a way to encode sorting >> <https://iceberg.apache.org/spec/#sorting> applied to data files and/or >> delete files, however that sorting order is currently not enforced for the >> writers. Violation of that sort order may affect performance, but not >> correctness. The proposal here is the following: >> >> 1. Introduce a configuration option (e.g sort-enforced) that allows >> taking into account the sorted nature of data files and equality delete >> files during query planning, this means that if the new setting is set but >> a compute engine is not enforcing the sort, the write will fail. There >> would also be a need to tag data files as "data-sorted" to ensure that all >> data files and delete files involved in a scan are sorted as well. >> 2. During data scans the following optimization is added: if a data >> file and all delete files that can affect it are sorted on the same >> columns, a sorted merge scan is used instead of joins. >> >> *Performance* >> Compared to positional deletes, sorted equality deletes DO NOT >> automatically prune the data files affected by the deletes (this by >> design), therefore, we should generally expect positional deletes to >> perform better. However, this effect can be mitigated by a linear factor >> using a partition spec with a bucketing transformation. For example, >> partitioning the data by the unique key into 16 buckets and producing 16 >> sorted equality delete files with associated partition values, reduces the >> amount of unrelated deletes that have to be scanned by a factor of 16. >> >> *Compatibility* >> The optimization is enabled by the sort-enforced configuration setting >> and is turned-off by default. Therefore, any existing jobs should continue >> to work unaffected. >> >> Thanks for taking a look. >> >> Best, >> >> -- >> Edgar R >> >