Fundamentally, it is very difficult to write position deletes with concurrent writers and conflicts for batch jobs too, as the inverted index may become invalid/stale.
The position deletes are created during the write phase. But conflicts are only detected at the commit stage. I assume the batch job should fail in this case. On Fri, Nov 1, 2024 at 10:57 AM Steven Wu <stevenz...@gmail.com> wrote: > Shani, > > That is a good point. It is certainly a limitation for the Flink job to > track the inverted index internally (which is what I had in mind). It can't > be shared/synchronized with other Flink jobs or other engines writing to > the same table. > > Thanks, > Steven > > On Fri, Nov 1, 2024 at 10:50 AM Shani Elharrar <sh...@upsolver.com.invalid> > wrote: > >> Even if Flink can create this state, it would have to be maintained >> against the Iceberg table, we wouldn't like duplicates (keys) if other >> systems / users update the table (e.g manual insert / updates using DML). >> >> Shani. >> >> On 1 Nov 2024, at 18:32, Steven Wu <stevenz...@gmail.com> wrote: >> >> >> > Add support for inverted indexes to reduce the cost of position lookup. >> This is fairly tricky to implement for streaming use cases without an >> external system. >> >> Anton, that is also what I was saying earlier. In Flink, the inverted >> index of (key, committed data files) can be tracked in Flink state. >> >> On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi <aokolnyc...@gmail.com> >> wrote: >> >>> I was a bit skeptical when we were adding equality deletes, but nothing >>> beats their performance during writes. We have to find an alternative >>> before deprecating. >>> >>> We are doing a lot of work to improve streaming, like reducing the cost >>> of commits, enabling a large (potentially infinite) number of snapshots, >>> changelog reads, and so on. It is a project goal to excel in streaming. >>> >>> I was going to focus on equality deletes after completing the DV work. I >>> believe we have these options: >>> >>> - Revisit the existing design of equality deletes (e.g. add more >>> restrictions, improve compaction, offer new writers). >>> - Standardize on the view-based approach [1] to handle streaming upserts >>> and CDC use cases, potentially making this part of the spec. >>> - Add support for inverted indexes to reduce the cost of position >>> lookup. This is fairly tricky to implement for streaming use cases without >>> an external system. Our runtime filtering in Spark today is equivalent to >>> looking up positions in an inverted index represented by another Iceberg >>> table. That may still not be enough for some streaming use cases. >>> >>> [1] - https://www.tabular.io/blog/hello-world-of-cdc/ >>> >>> - Anton >>> >>> чт, 31 жовт. 2024 р. о 21:31 Micah Kornfield <emkornfi...@gmail.com> >>> пише: >>> >>>> I agree that equality deletes have their place in streaming. I think >>>> the ultimate decision here is how opinionated Iceberg wants to be on its >>>> use-cases. If it really wants to stick to its origins of "slow moving >>>> data", then removing equality deletes would be inline with this. I think >>>> the other high level question is how much we allow for partially compatible >>>> features (the row lineage use-case feature was explicitly approved >>>> excluding equality deletes, and people seemed OK with it at the time. If >>>> all features need to work together, then maybe we need to rethink the >>>> design here so it can be forward compatible with equality deletes). >>>> >>>> I think one issue with equality deletes as stated in the spec is that >>>> they are overly broad. I'd be interested if people have any use cases that >>>> differ, but I think one way of narrowing (and probably a necessary building >>>> block for building something better) the specification scope on equality >>>> deletes is to focus on upsert/Streaming deletes. Two proposals in this >>>> regard are: >>>> >>>> 1. Require that equality deletes can only correspond to unique >>>> identifiers for the table. >>>> 2. Consider requiring that for equality deletes on partitioned tables, >>>> that the primary key must contain a partition column (I believe Flink at >>>> least already does this). It is less clear to me that this would meet all >>>> existing use-cases. But having this would allow for better incremental >>>> data-structures, which could then be partition based. >>>> >>>> Narrow scope to unique identifiers would allow for further building >>>> blocks already mentioned, like a secondary index (possible via LSM tree), >>>> that would allow for better performance overall. >>>> >>>> I generally agree with the sentiment that we shouldn't deprecate them >>>> until there is a viable replacement. With all due respect to my employer, >>>> let's not fall into the Google trap [1] :) >>>> >>>> Cheers, >>>> Micah >>>> >>>> [1] https://goomics.net/50/ >>>> >>>> >>>> >>>> On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo < >>>> alex...@starburstdata.com> wrote: >>>> >>>>> Hey all, >>>>> >>>>> Just to throw my 2 cents in, I agree with Steven and others that we do >>>>> need some kind of replacement before deprecating equality deletes. >>>>> They certainly have their problems, and do significantly increase >>>>> complexity as they are now, but the writing of position deletes is too >>>>> expensive for certain pipelines. >>>>> >>>>> We've been investigating using equality deletes for some of our >>>>> workloads at Starburst, the key advantage we were hoping to leverage is >>>>> cheap, effectively random access lookup deletes. >>>>> Say you have a UUID column that's unique in a table and want to delete >>>>> a row by UUID. With position deletes each delete is expensive without an >>>>> index on that UUID. >>>>> With equality deletes each delete is cheap and while reads/compaction >>>>> is expensive but when updates are frequent and reads are sporadic that's a >>>>> reasonable tradeoff. >>>>> >>>>> Pretty much what Jason and Steven have already said. >>>>> >>>>> Maybe there are some incremental improvements on equality deletes or >>>>> tips from similar systems that might alleviate some of their problems? >>>>> >>>>> - Alex Jo >>>>> >>>>> On Thu, Oct 31, 2024 at 10:58 AM Steven Wu <stevenz...@gmail.com> >>>>> wrote: >>>>> >>>>>> We probably all agree with the downside of equality deletes: it >>>>>> postpones all the work on the read path. >>>>>> >>>>>> In theory, we can implement position deletes only in the Flink >>>>>> streaming writer. It would require the tracking of last committed data >>>>>> files per key, which can be stored in Flink state (checkpointed). This is >>>>>> obviously quite expensive/challenging, but possible. >>>>>> >>>>>> I like to echo one benefit of equality deletes that Russel called out >>>>>> in the original email. Equality deletes would never have conflicts. that >>>>>> is >>>>>> important for streaming writers (Flink, Kafka connect, ...) that >>>>>> commit frequently (minutes or less). Assume Flink can write position >>>>>> deletes only and commit every 2 minutes. The long-running nature of >>>>>> streaming jobs can cause frequent commit conflicts with background delete >>>>>> compaction jobs. >>>>>> >>>>>> Overall, the streaming upsert write is not a well solved problem in >>>>>> Iceberg. This probably affects all streaming engines (Flink, Kafka >>>>>> connect, >>>>>> Spark streaming, ...). We need to come up with some better alternatives >>>>>> before we can deprecate equality deletes. >>>>>> >>>>>> >>>>>> On Thu, Oct 31, 2024 at 8:38 AM Russell Spitzer < >>>>>> russell.spit...@gmail.com> wrote: >>>>>> >>>>>>> For users of Equality Deletes, what are the key benefits to Equality >>>>>>> Deletes that you would like to preserve and could you please share some >>>>>>> concrete examples of the queries you want to run (and the schemas and >>>>>>> data >>>>>>> sizes you would like to run them against) and the latencies that would >>>>>>> be >>>>>>> acceptable? >>>>>>> >>>>>>> On Thu, Oct 31, 2024 at 10:05 AM Jason Fine >>>>>>> <ja...@upsolver.com.invalid> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Representing Upsolver here, we also make use of Equality Deletes to >>>>>>>> deliver high frequency low latency updates to our clients at scale. We >>>>>>>> have >>>>>>>> customers using them at scale and demonstrating the need and >>>>>>>> viability. We >>>>>>>> automate the process of converting them into positional deletes (or >>>>>>>> fully >>>>>>>> applying them) for more efficient engine queries in the background >>>>>>>> giving >>>>>>>> our users both low latency and good query performance. >>>>>>>> >>>>>>>> Equality Deletes were added since there isn't a good way to solve >>>>>>>> frequent updates otherwise. It would require some sort of index keeping >>>>>>>> track of every record in the table (by a predetermined PK) and >>>>>>>> maintaining >>>>>>>> such an index is a huge task that every tool interested in this would >>>>>>>> need >>>>>>>> to re-implement. It also becomes a bottleneck limiting table sizes. >>>>>>>> >>>>>>>> I don't think they should be removed without providing an >>>>>>>> alternative. Positional Deletes have a different performance profile >>>>>>>> inherently, requiring more upfront work proportional to the table size. >>>>>>>> >>>>>>>> On Thu, Oct 31, 2024 at 2:45 PM Jean-Baptiste Onofré < >>>>>>>> j...@nanthrax.net> wrote: >>>>>>>> >>>>>>>>> Hi Russell >>>>>>>>> >>>>>>>>> Thanks for the nice writeup and the proposal. >>>>>>>>> >>>>>>>>> I agree with your analysis, and I have the same feeling. However, I >>>>>>>>> think there are more than Flink that write equality delete files. >>>>>>>>> So, >>>>>>>>> I agree to deprecate in V3, but maybe be more "flexible" about >>>>>>>>> removal >>>>>>>>> in V4 in order to give time to engines to update. >>>>>>>>> I think that by deprecating equality deletes, we are clearly >>>>>>>>> focusing >>>>>>>>> on read performance and "consistency" (more than write). It's not >>>>>>>>> necessarily a bad thing but the streaming platform and data >>>>>>>>> ingestion >>>>>>>>> platforms will be probably concerned about that (by using >>>>>>>>> positional >>>>>>>>> deletes, they will have to scan/read all datafiles to find the >>>>>>>>> position, so painful). >>>>>>>>> >>>>>>>>> So, to summarize: >>>>>>>>> 1. Agree to deprecate equality deletes, but -1 to commit any target >>>>>>>>> for deletion before having a clear path for streaming platforms >>>>>>>>> (Flink, Beam, ...) >>>>>>>>> 2. In the meantime (during the deprecation period), I propose to >>>>>>>>> explore possible improvements for streaming platforms (maybe >>>>>>>>> finding a >>>>>>>>> way to avoid full data files scan, ...) >>>>>>>>> >>>>>>>>> Thanks ! >>>>>>>>> Regards >>>>>>>>> JB >>>>>>>>> >>>>>>>>> On Wed, Oct 30, 2024 at 10:06 PM Russell Spitzer >>>>>>>>> <russell.spit...@gmail.com> wrote: >>>>>>>>> > >>>>>>>>> > Background: >>>>>>>>> > >>>>>>>>> > 1) Position Deletes >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > Writers determine what rows are deleted and mark them in a 1 for >>>>>>>>> 1 representation. With delete vectors this means every data file has >>>>>>>>> at >>>>>>>>> most 1 delete vector that it is read in conjunction with to excise >>>>>>>>> deleted >>>>>>>>> rows. Reader overhead is more or less constant and is very >>>>>>>>> predictable. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > The main cost of this mode is that deletes must be determined at >>>>>>>>> write time which is expensive and can be more difficult for conflict >>>>>>>>> resolution >>>>>>>>> > >>>>>>>>> > 2) Equality Deletes >>>>>>>>> > >>>>>>>>> > Writers write out reference to what values are deleted (in a >>>>>>>>> partition or globally). There can be an unlimited number of equality >>>>>>>>> deletes and they all must be checked for every data file that is >>>>>>>>> read. The >>>>>>>>> cost of determining deleted rows is essentially given to the reader. >>>>>>>>> > >>>>>>>>> > Conflicts almost never happen since data files are not actually >>>>>>>>> changed and there is almost no cost to the writer to generate these. >>>>>>>>> Almost >>>>>>>>> all costs related to equality deletes are passed on to the reader. >>>>>>>>> > >>>>>>>>> > Proposal: >>>>>>>>> > >>>>>>>>> > Equality deletes are, in my opinion, unsustainable and we should >>>>>>>>> work on deprecating and removing them from the specification. At this >>>>>>>>> time, >>>>>>>>> I know of only one engine (Apache Flink) which produces these deletes >>>>>>>>> but >>>>>>>>> almost all engines have implementations to read them. The cost of >>>>>>>>> implementing equality deletes on the read path is difficult and >>>>>>>>> unpredictable in terms of memory usage and compute complexity. We’ve >>>>>>>>> had >>>>>>>>> suggestions of implementing rocksdb inorder to handle ever growing >>>>>>>>> sets of >>>>>>>>> equality deletes which in my opinion shows that we are going down the >>>>>>>>> wrong >>>>>>>>> path. >>>>>>>>> > >>>>>>>>> > Outside of performance, Equality deletes are also difficult to >>>>>>>>> use in conjunction with many other features. For example, any features >>>>>>>>> requiring CDC or Row lineage are basically impossible when equality >>>>>>>>> deletes >>>>>>>>> are in use. When Equality deletes are present, the state of the table >>>>>>>>> can >>>>>>>>> only be determined with a full scan making it difficult to update >>>>>>>>> differential structures. This means materialized views or indexes >>>>>>>>> need to >>>>>>>>> essentially be fully rebuilt whenever an equality delete is added to >>>>>>>>> the >>>>>>>>> table. >>>>>>>>> > >>>>>>>>> > Equality deletes essentially remove complexity from the write >>>>>>>>> side but then add what I believe is an unacceptable level of >>>>>>>>> complexity to >>>>>>>>> the read side. >>>>>>>>> > >>>>>>>>> > Because of this I suggest we deprecate Equality Deletes in V3 >>>>>>>>> and slate them for full removal from the Iceberg Spec in V4. >>>>>>>>> > >>>>>>>>> > I know this is a big change and compatibility breakage so I >>>>>>>>> would like to introduce this idea to the community and solicit >>>>>>>>> feedback >>>>>>>>> from all stakeholders. I am very flexible on this issue and would >>>>>>>>> like to >>>>>>>>> hear the best issues both for and against removal of Equality Deletes. >>>>>>>>> > >>>>>>>>> > Thanks everyone for your time, >>>>>>>>> > >>>>>>>>> > Russ Spitzer >>>>>>>>> > >>>>>>>>> > >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> *Jason Fine* >>>>>>>> Chief Software Architect >>>>>>>> ja...@upsolver.com | www.upsolver.com >>>>>>>> >>>>>>>