CMIW, the spec does not enforce `identifier fields` for equality
delete files. Engines are free to use different `equality_ids` among
commits, though the use case should be rare. Similarly, what sort order
should we use? It is common for a table to set sort order on columns other
than the primary keys. It would be more complicated if `equality_ids` vary
among commits.

Best,
Gang

On Wed, Apr 2, 2025 at 6:48 PM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> Hi Edgar,
>
> Thanks for the well described proposal!
>
> Knowing the Flink connector, I have the following concerns:
> - Flink connector currently doesn't sort the rows in the data files. It
> "chickens" out of this to avoid keeping anything in memory.
> - Sorting the equality delete rows would also add memory pressure to the
> writers.
> - Bucketing is good solution when the number or rows/changes is stable,
> but changes in load could require rebuckeing (reparationing). This would
> cause issues with the scope of the equality delete files.
>
> I have collected my concerns, but this doesn't mean that we should not
> consider this proposal, if the community would like to keep the equality
> delete files, and we don't deprecate them.
>
> Thanks, Peter
>
>
>
>
> On Tue, Apr 1, 2025, 11:53 Edgar Rodriguez
> <edgar.rodrig...@airbnb.com.invalid> wrote:
>
>> Hi all,
>>
>> I know there's been some conversations regarding optimization of equality
>> deletes and even their possible deprecation. We have been thinking
>> internally about a way to optimize merge-on-read with equality deletes to
>> better balance the read performance while having the benefits of performant
>> writes by encoding the equality deletes in streaming engines.
>>
>> The following are our high level thoughts and we'd like to share with the
>> community to see if we are thinking the right way about this in terms of
>> the Iceberg spec or if there are any obvious issues with the following
>> approach that we've missed. We'd appreciate the feedback and if you see
>> that at a high-level this sounds reasonable, we could follow up the
>> presented ideas with a more in-detail design doc to review.
>>
>> *Motivation*
>> Iceberg currently supports two types of deletes for merge-on-read:
>>
>>    1. Positional deletes stored as sorted tuples: (file name, position).
>>    2. Equality deletes, stored in roughly the same schema as the data
>>    files, where a value in a column indicates an equality added to the
>>    conjunction.
>>
>> Since positional deletes are sorted the same way the data is (i.e. by
>> file position), a simple sorted merge can be applied, which is very
>> efficient for the reader. Equality deletes have to be joined in some way,
>> either by shuffling the data or by broadcasting the equality deletes.
>>
>> Writers rarely know the positions of the rows that need to be deleted,
>> instead the data to be deleted is identified at write time (impacting write
>> performance) via some unique key. As a result, to optimize deletes to
>> balance out better performance for read/write generally the following two
>> options are considered:
>>
>>    1. Encode deletions using equality deletes. Periodically run a
>>    maintenance job that rewrites equality deletes into positional deletes to
>>    maintain acceptable read-side performance.
>>    2. Maintain an index of all the data in data files and write
>>    positional deletes directly by converting the unique key into a positional
>>    delete tuple.
>>
>>
>> Both of these implementations have significant downsides:
>>
>>    - In the former case read-side performance is dependent on the number
>>    of modifications applied to the table and the frequency of running the
>>    maintenance job. Sudden change in table workload (e.g. a backfill 
>> modifying
>>    large number of rows) can affect read performance significantly.
>>    Additionally, the maintenance job performs computation proportional to the
>>    total number of rows in the table, which is larger than the mutated set 
>> and
>>    tends to grow over time.
>>    - In the latter case, the writer has to maintain the index, which for
>>    large tables can be measured in TBs and can lead to long checkpoints and
>>    restarts.
>>
>> One observation is that the significant speed up on the read side is
>> coming from the fact that positional deletes and data files are sorted in
>> the same way. This proposal describes sorted equality deletes with
>> bucketing that provide a middle ground between the two encodings that is
>> both reader and writer friendly.
>>
>> *Requirements*
>>
>>    - All existing jobs and Iceberg tables should continue to work
>>    without any effect, unless the optimization is explicitly enabled.
>>    - Only applicable for tables that specify sort order.
>>
>>
>> *Proposal: Sorted Equality Deletes*
>> The current Iceberg spec provides a way to encode sorting
>> <https://iceberg.apache.org/spec/#sorting> applied to data files and/or
>> delete files, however that sorting order is currently not enforced for the
>> writers. Violation of that sort order may affect performance, but not
>> correctness. The proposal here is the following:
>>
>>    1. Introduce a configuration option (e.g sort-enforced) that allows
>>    taking into account the sorted nature of data files and equality delete
>>    files during query planning, this means that if the new setting is set but
>>    a compute engine is not enforcing the sort, the write will fail. There
>>    would also be a need to tag data files as "data-sorted" to ensure that all
>>    data files and delete files involved in a scan are sorted as well.
>>    2. During data scans the following optimization is added: if a data
>>    file and all delete files that can affect it are sorted on the same
>>    columns, a sorted merge scan is used instead of joins.
>>
>> *Performance*
>> Compared to positional deletes, sorted equality deletes DO NOT
>> automatically prune the data files affected by the deletes (this by
>> design), therefore, we should generally expect positional deletes to
>> perform better. However, this effect can be mitigated by a linear factor
>> using a partition spec with a bucketing transformation. For example,
>> partitioning the data by the unique key into 16 buckets and producing 16
>> sorted equality delete files with associated partition values, reduces the
>> amount of unrelated deletes that have to be scanned by a factor of 16.
>>
>> *Compatibility*
>> The optimization is enabled by the sort-enforced configuration setting
>> and is turned-off by default. Therefore, any existing jobs should continue
>> to work unaffected.
>>
>> Thanks for taking a look.
>>
>> Best,
>>
>> --
>> Edgar R
>>
>

Reply via email to