Hi all,

I know there's been some conversations regarding optimization of equality
deletes and even their possible deprecation. We have been thinking
internally about a way to optimize merge-on-read with equality deletes to
better balance the read performance while having the benefits of performant
writes by encoding the equality deletes in streaming engines.

The following are our high level thoughts and we'd like to share with the
community to see if we are thinking the right way about this in terms of
the Iceberg spec or if there are any obvious issues with the following
approach that we've missed. We'd appreciate the feedback and if you see
that at a high-level this sounds reasonable, we could follow up the
presented ideas with a more in-detail design doc to review.

*Motivation*
Iceberg currently supports two types of deletes for merge-on-read:

   1. Positional deletes stored as sorted tuples: (file name, position).
   2. Equality deletes, stored in roughly the same schema as the data
   files, where a value in a column indicates an equality added to the
   conjunction.

Since positional deletes are sorted the same way the data is (i.e. by file
position), a simple sorted merge can be applied, which is very efficient
for the reader. Equality deletes have to be joined in some way, either by
shuffling the data or by broadcasting the equality deletes.

Writers rarely know the positions of the rows that need to be deleted,
instead the data to be deleted is identified at write time (impacting write
performance) via some unique key. As a result, to optimize deletes to
balance out better performance for read/write generally the following two
options are considered:

   1. Encode deletions using equality deletes. Periodically run a
   maintenance job that rewrites equality deletes into positional deletes to
   maintain acceptable read-side performance.
   2. Maintain an index of all the data in data files and write positional
   deletes directly by converting the unique key into a positional delete
   tuple.


Both of these implementations have significant downsides:

   - In the former case read-side performance is dependent on the number of
   modifications applied to the table and the frequency of running the
   maintenance job. Sudden change in table workload (e.g. a backfill modifying
   large number of rows) can affect read performance significantly.
   Additionally, the maintenance job performs computation proportional to the
   total number of rows in the table, which is larger than the mutated set and
   tends to grow over time.
   - In the latter case, the writer has to maintain the index, which for
   large tables can be measured in TBs and can lead to long checkpoints and
   restarts.

One observation is that the significant speed up on the read side is coming
from the fact that positional deletes and data files are sorted in the same
way. This proposal describes sorted equality deletes with bucketing that
provide a middle ground between the two encodings that is both reader and
writer friendly.

*Requirements*

   - All existing jobs and Iceberg tables should continue to work without
   any effect, unless the optimization is explicitly enabled.
   - Only applicable for tables that specify sort order.


*Proposal: Sorted Equality Deletes*
The current Iceberg spec provides a way to encode sorting
<https://iceberg.apache.org/spec/#sorting> applied to data files and/or
delete files, however that sorting order is currently not enforced for the
writers. Violation of that sort order may affect performance, but not
correctness. The proposal here is the following:

   1. Introduce a configuration option (e.g sort-enforced) that allows
   taking into account the sorted nature of data files and equality delete
   files during query planning, this means that if the new setting is set but
   a compute engine is not enforcing the sort, the write will fail. There
   would also be a need to tag data files as "data-sorted" to ensure that all
   data files and delete files involved in a scan are sorted as well.
   2. During data scans the following optimization is added: if a data file
   and all delete files that can affect it are sorted on the same columns, a
   sorted merge scan is used instead of joins.

*Performance*
Compared to positional deletes, sorted equality deletes DO NOT
automatically prune the data files affected by the deletes (this by
design), therefore, we should generally expect positional deletes to
perform better. However, this effect can be mitigated by a linear factor
using a partition spec with a bucketing transformation. For example,
partitioning the data by the unique key into 16 buckets and producing 16
sorted equality delete files with associated partition values, reduces the
amount of unrelated deletes that have to be scanned by a factor of 16.

*Compatibility*
The optimization is enabled by the sort-enforced configuration setting and
is turned-off by default. Therefore, any existing jobs should continue to
work unaffected.

Thanks for taking a look.

Best,

-- 
Edgar R

Reply via email to