Hey Edgar,

Thanks for the well articulated proposal.

I'm a little concerned that the proposed approach only partially addresses
the underlying challenge with equality deletes.  Equality deletes are
extremely powerful because you can delete a row anywhere in the dataset
without any read cost.  The delete applies instantly and globally.  A
positional delete takes the opposite approach reading all applicable data
to identify which rows are affected.  These two approaches are at opposite
ends of the spectrum of when work is done.

The performance of how we can apply the equality deletes is greatly
overshadowed by the impact that amplification of delete files has on
reading the data.  In the degenerate case, you could have a client
repeatedly deleting single rows of data at a global scope.  The order
internal to those small deletes doesn't really matter, it's the compounding
impact of the number of delete files that need to be read in order to apply
the deletes.  Each file may have hundreds of equality delete files that
match to each and every data file.  What makes this worse is that
compaction is difficult because we don't encode the sequencing of deletes
in a way that can be combined.

If scaling equality deletes is the goal, I feel like introducing a shuffle
based join approach in spark (as opposed to the current broadcast) would
significantly help with large numbers of delete files and memory pressure,
but with the added cost of a shuffle.  Aggressive maintenance is the only
effective approach I've seen work, but that can be costly if there's a high
rate of curn in the data.

I would love to see an approach for deletes that would strike a better
balance between read and write cost (e.g. some form of reverse indexing) to
fill in the large tradeoff gap we have between equality and positional
deletes.

Best,
Dan


On Wed, Apr 2, 2025 at 11:00 PM Gang Wu <ust...@gmail.com> wrote:

> CMIW, the spec does not enforce `identifier fields` for equality
> delete files. Engines are free to use different `equality_ids` among
> commits, though the use case should be rare. Similarly, what sort order
> should we use? It is common for a table to set sort order on columns other
> than the primary keys. It would be more complicated if `equality_ids` vary
> among commits.
>
> Best,
> Gang
>
> On Wed, Apr 2, 2025 at 6:48 PM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Hi Edgar,
>>
>> Thanks for the well described proposal!
>>
>> Knowing the Flink connector, I have the following concerns:
>> - Flink connector currently doesn't sort the rows in the data files. It
>> "chickens" out of this to avoid keeping anything in memory.
>> - Sorting the equality delete rows would also add memory pressure to the
>> writers.
>> - Bucketing is good solution when the number or rows/changes is stable,
>> but changes in load could require rebuckeing (reparationing). This would
>> cause issues with the scope of the equality delete files.
>>
>> I have collected my concerns, but this doesn't mean that we should not
>> consider this proposal, if the community would like to keep the equality
>> delete files, and we don't deprecate them.
>>
>> Thanks, Peter
>>
>>
>>
>>
>> On Tue, Apr 1, 2025, 11:53 Edgar Rodriguez
>> <edgar.rodrig...@airbnb.com.invalid> wrote:
>>
>>> Hi all,
>>>
>>> I know there's been some conversations regarding optimization of
>>> equality deletes and even their possible deprecation. We have been thinking
>>> internally about a way to optimize merge-on-read with equality deletes to
>>> better balance the read performance while having the benefits of performant
>>> writes by encoding the equality deletes in streaming engines.
>>>
>>> The following are our high level thoughts and we'd like to share with
>>> the community to see if we are thinking the right way about this in terms
>>> of the Iceberg spec or if there are any obvious issues with the following
>>> approach that we've missed. We'd appreciate the feedback and if you see
>>> that at a high-level this sounds reasonable, we could follow up the
>>> presented ideas with a more in-detail design doc to review.
>>>
>>> *Motivation*
>>> Iceberg currently supports two types of deletes for merge-on-read:
>>>
>>>    1. Positional deletes stored as sorted tuples: (file name, position).
>>>    2. Equality deletes, stored in roughly the same schema as the data
>>>    files, where a value in a column indicates an equality added to the
>>>    conjunction.
>>>
>>> Since positional deletes are sorted the same way the data is (i.e. by
>>> file position), a simple sorted merge can be applied, which is very
>>> efficient for the reader. Equality deletes have to be joined in some way,
>>> either by shuffling the data or by broadcasting the equality deletes.
>>>
>>> Writers rarely know the positions of the rows that need to be deleted,
>>> instead the data to be deleted is identified at write time (impacting write
>>> performance) via some unique key. As a result, to optimize deletes to
>>> balance out better performance for read/write generally the following two
>>> options are considered:
>>>
>>>    1. Encode deletions using equality deletes. Periodically run a
>>>    maintenance job that rewrites equality deletes into positional deletes to
>>>    maintain acceptable read-side performance.
>>>    2. Maintain an index of all the data in data files and write
>>>    positional deletes directly by converting the unique key into a 
>>> positional
>>>    delete tuple.
>>>
>>>
>>> Both of these implementations have significant downsides:
>>>
>>>    - In the former case read-side performance is dependent on the
>>>    number of modifications applied to the table and the frequency of running
>>>    the maintenance job. Sudden change in table workload (e.g. a backfill
>>>    modifying large number of rows) can affect read performance 
>>> significantly.
>>>    Additionally, the maintenance job performs computation proportional to 
>>> the
>>>    total number of rows in the table, which is larger than the mutated set 
>>> and
>>>    tends to grow over time.
>>>    - In the latter case, the writer has to maintain the index, which
>>>    for large tables can be measured in TBs and can lead to long checkpoints
>>>    and restarts.
>>>
>>> One observation is that the significant speed up on the read side is
>>> coming from the fact that positional deletes and data files are sorted in
>>> the same way. This proposal describes sorted equality deletes with
>>> bucketing that provide a middle ground between the two encodings that is
>>> both reader and writer friendly.
>>>
>>> *Requirements*
>>>
>>>    - All existing jobs and Iceberg tables should continue to work
>>>    without any effect, unless the optimization is explicitly enabled.
>>>    - Only applicable for tables that specify sort order.
>>>
>>>
>>> *Proposal: Sorted Equality Deletes*
>>> The current Iceberg spec provides a way to encode sorting
>>> <https://iceberg.apache.org/spec/#sorting> applied to data files and/or
>>> delete files, however that sorting order is currently not enforced for the
>>> writers. Violation of that sort order may affect performance, but not
>>> correctness. The proposal here is the following:
>>>
>>>    1. Introduce a configuration option (e.g sort-enforced) that allows
>>>    taking into account the sorted nature of data files and equality delete
>>>    files during query planning, this means that if the new setting is set 
>>> but
>>>    a compute engine is not enforcing the sort, the write will fail. There
>>>    would also be a need to tag data files as "data-sorted" to ensure that 
>>> all
>>>    data files and delete files involved in a scan are sorted as well.
>>>    2. During data scans the following optimization is added: if a data
>>>    file and all delete files that can affect it are sorted on the same
>>>    columns, a sorted merge scan is used instead of joins.
>>>
>>> *Performance*
>>> Compared to positional deletes, sorted equality deletes DO NOT
>>> automatically prune the data files affected by the deletes (this by
>>> design), therefore, we should generally expect positional deletes to
>>> perform better. However, this effect can be mitigated by a linear factor
>>> using a partition spec with a bucketing transformation. For example,
>>> partitioning the data by the unique key into 16 buckets and producing 16
>>> sorted equality delete files with associated partition values, reduces the
>>> amount of unrelated deletes that have to be scanned by a factor of 16.
>>>
>>> *Compatibility*
>>> The optimization is enabled by the sort-enforced configuration setting
>>> and is turned-off by default. Therefore, any existing jobs should continue
>>> to work unaffected.
>>>
>>> Thanks for taking a look.
>>>
>>> Best,
>>>
>>> --
>>> Edgar R
>>>
>>

Reply via email to