I'm strongly in favor of moving to the Delta + Base table approach discussed in the cookbook above. I wonder if we should codify that into something more standardized but it seems to me to be a much better path forward. I'm not sure we need to support his at the spec level but it would be nice if we could provide a table that automatically was broken into sub tables and had well defined operations on it.
For example: FastUpdateTable: Requires: Primary Key Columns Long Max Delta Size Contains: Private Iceberg Table: Delta Private Iceberg Table: Base On All Scans - Return a view which joins delta and base on primary key, if Delta has a record for a given primary key discard the base record On All Writes - Perform all writes against the delta table, only MERGE is allowed. Append is forbidden (No PK Guarantees) Only position deletes are allowed. On Delta Table Size Max Delta Size- - Upsert DELTA into BASE Clear upserted records from Delta If the Delta Table size is kept small I think this would be almost as performant as Equality deletes but still be compatible with row-lineage and other indexing features. On Tue, Nov 19, 2024 at 7:12 AM Manu Zhang <owenzhang1...@gmail.com> wrote: > Hi Ajantha, > > I'm proposing exploring a view-based approach similar to the > changelog-mirror table pattern[1] rather than supporting delta writers for > Kafka connect Iceberg sink. > > 1. > https://www.tabular.io/apache-iceberg-cookbook/data-engineering-cdc-table-mirroring/ > > On Tue, Nov 19, 2024 at 7:38 PM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> I don’t think it’s a problem while an alternative is explored (the JDK >> itself does that pretty often). >> So it’s up to the community: of course I’m against removing it without >> solid alternative, but deprecation is fine imho. >> >> Regards >> JB >> >> Le mar. 19 nov. 2024 à 12:19, Ajantha Bhat <ajanthab...@gmail.com> a >> écrit : >> >>> - ok for deprecate equality deletes >>>> - not ok to remove it >>> >>> >>> @JB: I don't think it is a good idea to use deprecated functionality in >>> the new feature development. >>> Hence, my specific question was about kafka connect upsert operation. >>> >>> @Manu: I meant the delta writers for kafka connect Iceberg sink (which >>> in turn used for upsetting the CDC records) >>> https://github.com/apache/iceberg/issues/10842 >>> >>> >>> - Ajantha >>> >>> >>> >>> On Tue, Nov 19, 2024 at 3:08 PM Manu Zhang <owenzhang1...@gmail.com> >>> wrote: >>> >>>> I second Anton's proposal to standardize on a view-based approach to >>>> handle CDC cases. >>>> Actually, it's already been explored in detail[1] by Jack before. >>>> >>>> [1] Improving Change Data Capture Use Case for Apache Iceberg >>>> <https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit?tab=t.0#heading=h.94xnx4qg3bnt> >>>> >>>> >>>> On Tue, Nov 19, 2024 at 4:16 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>> wrote: >>>> >>>>> My proposal is the following (already expressed): >>>>> - ok for deprecate equality deletes >>>>> - not ok to remove it >>>>> - work on position deletes improvements to address streaming use >>>>> cases. I think we should explore different approaches. Personally I think >>>>> a >>>>> possible approach would be to find index way to data files to avoid full >>>>> scan to find row position. >>>>> >>>>> My $0.01 :) >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> Le mar. 19 nov. 2024 à 07:53, Ajantha Bhat <ajanthab...@gmail.com> a >>>>> écrit : >>>>> >>>>>> Hi, What's the conclusion on this thread? >>>>>> >>>>>> Users are looking for Upsert (CDC) support for OSS Iceberg >>>>>> kafka connect sink. >>>>>> We only support appends at the moment. Can we go ahead and implement >>>>>> the upserts using equality deletes? >>>>>> >>>>>> >>>>>> - Ajantha >>>>>> >>>>>> On Sun, Nov 10, 2024 at 11:56 AM Vignesh <vignesh.v...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I am reading about iceberg and am quite new to this. >>>>>>> This puffin would be an index from key to data file. Other use cases >>>>>>> of Puffin, such as statistics are at a per file level if I understand >>>>>>> correctly. >>>>>>> >>>>>>> Where would the puffin about key->data file be stored? It is a >>>>>>> property of the entire table. >>>>>>> >>>>>>> Thanks, >>>>>>> Vignesh. >>>>>>> >>>>>>> >>>>>>> On Sat, Nov 9, 2024 at 2:17 AM Shani Elharrar >>>>>>> <sh...@upsolver.com.invalid> wrote: >>>>>>> >>>>>>>> JB, this is what we do, we write Equality Deletes and periodically >>>>>>>> convert them to Positional Deletes. >>>>>>>> >>>>>>>> We could probably index the keys, maybe partially index using bloom >>>>>>>> filters, the best would be to put those bloom filters inside puffin. >>>>>>>> >>>>>>>> Shani. >>>>>>>> >>>>>>>> On 9 Nov 2024, at 11:11, Jean-Baptiste Onofré <j...@nanthrax.net> >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I agree with Peter here, and I would say that it would be an issue >>>>>>>> for multi-engine support. >>>>>>>> >>>>>>>> I think, as I already mentioned with others, we should explore an >>>>>>>> alternative. >>>>>>>> As the main issue is the datafile scan in streaming context, maybe >>>>>>>> we could find a way to "index"/correlate for positional deletes with >>>>>>>> limited scanning. >>>>>>>> I will think again about that :) >>>>>>>> >>>>>>>> Regards >>>>>>>> JB >>>>>>>> >>>>>>>> On Sat, Nov 9, 2024 at 6:48 AM Péter Váry < >>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Imran, >>>>>>>>> >>>>>>>>> I don't think it's a good idea to start creating multiple types of >>>>>>>>> Iceberg tables. Iceberg's main selling point is compatibility between >>>>>>>>> engines. If we don't have readers and writers for all types of >>>>>>>>> tables, then >>>>>>>>> we remove compatibility from the equation and engine specific formats >>>>>>>>> always win. OTOH, if we write readers and writers for all types of >>>>>>>>> tables >>>>>>>>> then we are back on square one. >>>>>>>>> >>>>>>>>> Identifier fields are a table schema concept and used in many >>>>>>>>> cases during query planning and execution. This is why they are >>>>>>>>> defined as >>>>>>>>> part of the SQL spec, and this is why Iceberg defines them as well. >>>>>>>>> One use >>>>>>>>> case is where they can be used to merge deletes (independently of how >>>>>>>>> they >>>>>>>>> are manifested) and subsequent inserts, into updates. >>>>>>>>> >>>>>>>>> Flink SQL doesn't allow creating tables with partition transforms, >>>>>>>>> so no new table could be created by Flink SQL using transforms, but >>>>>>>>> tables >>>>>>>>> created by other engines could still be used (both read an write). >>>>>>>>> Also you >>>>>>>>> can create such tables in Flink using the Java API. >>>>>>>>> >>>>>>>>> Requiring partition columns be part of the identifier fields is >>>>>>>>> coming from the practical consideration, that you want to limit the >>>>>>>>> scope >>>>>>>>> of the equality deletes as much as possible. Otherwise all of the >>>>>>>>> equality >>>>>>>>> deletes should be table global, and they should be read by every >>>>>>>>> reader. We >>>>>>>>> could write those, we just decided that we don't want to allow the >>>>>>>>> user to >>>>>>>>> do this, as it is most cases a bad idea. >>>>>>>>> >>>>>>>>> I hope this helps, >>>>>>>>> Peter >>>>>>>>> >>>>>>>>> On Fri, Nov 8, 2024, 22:01 Imran Rashid >>>>>>>>> <iras...@cloudera.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> I'm not down in the weeds at all myself on implementation >>>>>>>>>> details, so forgive me if I'm wrong about the details here. >>>>>>>>>> >>>>>>>>>> I can see all the viewpoints -- both that equality deletes enable >>>>>>>>>> some use cases, but also make others far more difficult. What >>>>>>>>>> surprised me >>>>>>>>>> the most is that Iceberg does not provide a way to distinguish these >>>>>>>>>> two >>>>>>>>>> table "types". >>>>>>>>>> >>>>>>>>>> At first, I thought the presence of an identifier-field ( >>>>>>>>>> https://iceberg.apache.org/spec/#identifier-field-ids) indicated >>>>>>>>>> that the table was a target for equality deletes. But, then it >>>>>>>>>> turns out >>>>>>>>>> identifier-fields are also useful for changelog views even without >>>>>>>>>> equality >>>>>>>>>> deletes -- IIUC, they show that a delete + insert should actually be >>>>>>>>>> interpreted as an update in changelog view. >>>>>>>>>> >>>>>>>>>> To be perfectly honest, I'm confused about all of these details >>>>>>>>>> -- from my read, the spec does not indicate this relationship between >>>>>>>>>> identifier-fields and equality_ids in equality delete files ( >>>>>>>>>> https://iceberg.apache.org/spec/#equality-delete-files), but I >>>>>>>>>> think that is the way Flink works. Flink itself seems to have even >>>>>>>>>> more >>>>>>>>>> limitations -- no partition transforms are allowed, and all partition >>>>>>>>>> columns must be a subset of the identifier fields. Is that just a >>>>>>>>>> Flink >>>>>>>>>> limitation, or is that the intended behavior in the spec? (Or maybe >>>>>>>>>> user-error on my part?) Those seem like very reasonable >>>>>>>>>> limitations, from >>>>>>>>>> an implementation point-of-view. But OTOH, as a user, this seems to >>>>>>>>>> be >>>>>>>>>> directly contrary to some of the promises of Iceberg. >>>>>>>>>> >>>>>>>>>> Its easy to see if a table already has equality deletes in it, by >>>>>>>>>> looking at the metadata. But is there any way to indicate that a >>>>>>>>>> table (or >>>>>>>>>> branch of a table) _must not_ have equality deletes added to it? >>>>>>>>>> >>>>>>>>>> If that were possible, it seems like we could support both use >>>>>>>>>> cases. We could continue to optimize for the streaming ingestion >>>>>>>>>> use cases >>>>>>>>>> using equality deletes. But we could also build more optimizations >>>>>>>>>> into >>>>>>>>>> the "non-streaming-ingestion" branches. And we could document the >>>>>>>>>> tradeoff >>>>>>>>>> so it is much clearer to end users. >>>>>>>>>> >>>>>>>>>> To maintain compatibility, I suppose that the change would be >>>>>>>>>> that equality deletes continue to be allowed by default, but we'd >>>>>>>>>> add a new >>>>>>>>>> field to indicate that for some tables (or branches of a table), >>>>>>>>>> equality >>>>>>>>>> deletes would not be allowed. And it would be an error for an >>>>>>>>>> engine to >>>>>>>>>> make an update which added an equality delete to such a table. >>>>>>>>>> >>>>>>>>>> Maybe that change would even be possible in V3. >>>>>>>>>> >>>>>>>>>> And if all the performance improvements to equality deletes make >>>>>>>>>> this a moot point, we could drop the field in v4. But it seems like >>>>>>>>>> a >>>>>>>>>> mistake to both limit the non-streaming use-case AND have confusing >>>>>>>>>> limitations for the end-user in the meantime. >>>>>>>>>> >>>>>>>>>> I would happily be corrected about my understanding of all of the >>>>>>>>>> above. >>>>>>>>>> >>>>>>>>>> thanks! >>>>>>>>>> Imran >>>>>>>>>> >>>>>>>>>> On Tue, Nov 5, 2024 at 9:16 AM Bryan Keller <brya...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I also feel we should keep equality deletes until we have an >>>>>>>>>>> alternative solution for streaming updates/deletes. >>>>>>>>>>> >>>>>>>>>>> -Bryan >>>>>>>>>>> >>>>>>>>>>> On Nov 4, 2024, at 8:33 AM, Péter Váry < >>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Well, it seems like I'm a little late, so most of the arguments >>>>>>>>>>> are voiced. >>>>>>>>>>> >>>>>>>>>>> I agree that we should not deprecate the equality deletes until >>>>>>>>>>> we have a replacement feature. >>>>>>>>>>> I think one of the big advantages of Iceberg is that it supports >>>>>>>>>>> batch processing and streaming ingestion too. >>>>>>>>>>> For streaming ingestion we need a way to update existing data in >>>>>>>>>>> a performant way, but restricting deletes for the primary keys >>>>>>>>>>> seems like >>>>>>>>>>> enough from the streaming perspective. >>>>>>>>>>> >>>>>>>>>>> Equality deletes allow a very wide range of applications, which >>>>>>>>>>> we might be able to narrow down a bit, but still keep useful. So if >>>>>>>>>>> we want >>>>>>>>>>> to go down this road, we need to start collecting the requirements. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Peter >>>>>>>>>>> >>>>>>>>>>> Shani Elharrar <sh...@upsolver.com.invalid> ezt írta (időpont: >>>>>>>>>>> 2024. nov. 1., P, 19:22): >>>>>>>>>>> >>>>>>>>>>>> I understand how it makes sense for batch jobs, but it damages >>>>>>>>>>>> stream jobs, using equality deletes works much better for >>>>>>>>>>>> streaming (which >>>>>>>>>>>> have a strict SLA for delays), and in order to decrease the >>>>>>>>>>>> performance >>>>>>>>>>>> penalty - systems can rewrite the equality deletes to positional >>>>>>>>>>>> deletes. >>>>>>>>>>>> >>>>>>>>>>>> Shani. >>>>>>>>>>>> >>>>>>>>>>>> On 1 Nov 2024, at 20:06, Steven Wu <stevenz...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Fundamentally, it is very difficult to write position deletes >>>>>>>>>>>> with concurrent writers and conflicts for batch jobs too, as the >>>>>>>>>>>> inverted >>>>>>>>>>>> index may become invalid/stale. >>>>>>>>>>>> >>>>>>>>>>>> The position deletes are created during the write phase. But >>>>>>>>>>>> conflicts are only detected at the commit stage. I assume the >>>>>>>>>>>> batch job >>>>>>>>>>>> should fail in this case. >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Nov 1, 2024 at 10:57 AM Steven Wu <stevenz...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Shani, >>>>>>>>>>>>> >>>>>>>>>>>>> That is a good point. It is certainly a limitation for the >>>>>>>>>>>>> Flink job to track the inverted index internally (which is what I >>>>>>>>>>>>> had in >>>>>>>>>>>>> mind). It can't be shared/synchronized with other Flink jobs or >>>>>>>>>>>>> other >>>>>>>>>>>>> engines writing to the same table. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Steven >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Nov 1, 2024 at 10:50 AM Shani Elharrar >>>>>>>>>>>>> <sh...@upsolver.com.invalid> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Even if Flink can create this state, it would have to be >>>>>>>>>>>>>> maintained against the Iceberg table, we wouldn't like >>>>>>>>>>>>>> duplicates (keys) if >>>>>>>>>>>>>> other systems / users update the table (e.g manual insert / >>>>>>>>>>>>>> updates using >>>>>>>>>>>>>> DML). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Shani. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 1 Nov 2024, at 18:32, Steven Wu <stevenz...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> > Add support for inverted indexes to reduce the cost of >>>>>>>>>>>>>> position lookup. This is fairly tricky to implement for >>>>>>>>>>>>>> streaming use cases >>>>>>>>>>>>>> without an external system. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Anton, that is also what I was saying earlier. In Flink, the >>>>>>>>>>>>>> inverted index of (key, committed data files) can be tracked in >>>>>>>>>>>>>> Flink state. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi < >>>>>>>>>>>>>> aokolnyc...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I was a bit skeptical when we were adding equality deletes, >>>>>>>>>>>>>>> but nothing beats their performance during writes. We have to >>>>>>>>>>>>>>> find an >>>>>>>>>>>>>>> alternative before deprecating. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We are doing a lot of work to improve streaming, like >>>>>>>>>>>>>>> reducing the cost of commits, enabling a large (potentially >>>>>>>>>>>>>>> infinite) >>>>>>>>>>>>>>> number of snapshots, changelog reads, and so on. It is a >>>>>>>>>>>>>>> project goal to >>>>>>>>>>>>>>> excel in streaming. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I was going to focus on equality deletes after completing >>>>>>>>>>>>>>> the DV work. I believe we have these options: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Revisit the existing design of equality deletes (e.g. add >>>>>>>>>>>>>>> more restrictions, improve compaction, offer new writers). >>>>>>>>>>>>>>> - Standardize on the view-based approach [1] to handle >>>>>>>>>>>>>>> streaming upserts and CDC use cases, potentially making this >>>>>>>>>>>>>>> part of the >>>>>>>>>>>>>>> spec. >>>>>>>>>>>>>>> - Add support for inverted indexes to reduce the cost of >>>>>>>>>>>>>>> position lookup. This is fairly tricky to implement for >>>>>>>>>>>>>>> streaming use cases >>>>>>>>>>>>>>> without an external system. Our runtime filtering in Spark >>>>>>>>>>>>>>> today is >>>>>>>>>>>>>>> equivalent to looking up positions in an inverted index >>>>>>>>>>>>>>> represented by >>>>>>>>>>>>>>> another Iceberg table. That may still not be enough for some >>>>>>>>>>>>>>> streaming use >>>>>>>>>>>>>>> cases. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] - https://www.tabular.io/blog/hello-world-of-cdc/ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Anton >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> чт, 31 жовт. 2024 р. о 21:31 Micah Kornfield < >>>>>>>>>>>>>>> emkornfi...@gmail.com> пише: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I agree that equality deletes have their place in >>>>>>>>>>>>>>>> streaming. I think the ultimate decision here is how >>>>>>>>>>>>>>>> opinionated >>>>>>>>>>>>>>>> Iceberg wants to be on its use-cases. If it really wants to >>>>>>>>>>>>>>>> stick to its >>>>>>>>>>>>>>>> origins of "slow moving data", then removing equality deletes >>>>>>>>>>>>>>>> would be >>>>>>>>>>>>>>>> inline with this. I think the other high level question is >>>>>>>>>>>>>>>> how much we >>>>>>>>>>>>>>>> allow for partially compatible features (the row lineage >>>>>>>>>>>>>>>> use-case feature >>>>>>>>>>>>>>>> was explicitly approved excluding equality deletes, and people >>>>>>>>>>>>>>>> seemed OK >>>>>>>>>>>>>>>> with it at the time. If all features need to work together, >>>>>>>>>>>>>>>> then maybe we >>>>>>>>>>>>>>>> need to rethink the design here so it can be forward >>>>>>>>>>>>>>>> compatible with >>>>>>>>>>>>>>>> equality deletes). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think one issue with equality deletes as stated in the >>>>>>>>>>>>>>>> spec is that they are overly broad. I'd be interested if >>>>>>>>>>>>>>>> people have any >>>>>>>>>>>>>>>> use cases that differ, but I think one way of narrowing (and >>>>>>>>>>>>>>>> probably a >>>>>>>>>>>>>>>> necessary building block for building something better) the >>>>>>>>>>>>>>>> specification >>>>>>>>>>>>>>>> scope on equality deletes is to focus on upsert/Streaming >>>>>>>>>>>>>>>> deletes. Two >>>>>>>>>>>>>>>> proposals in this regard are: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. Require that equality deletes can only correspond to >>>>>>>>>>>>>>>> unique identifiers for the table. >>>>>>>>>>>>>>>> 2. Consider requiring that for equality deletes on >>>>>>>>>>>>>>>> partitioned tables, that the primary key must contain a >>>>>>>>>>>>>>>> partition column (I >>>>>>>>>>>>>>>> believe Flink at least already does this). It is less clear >>>>>>>>>>>>>>>> to me that >>>>>>>>>>>>>>>> this would meet all existing use-cases. But having this would >>>>>>>>>>>>>>>> allow for >>>>>>>>>>>>>>>> better incremental data-structures, which could then be >>>>>>>>>>>>>>>> partition based. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Narrow scope to unique identifiers would allow for further >>>>>>>>>>>>>>>> building blocks already mentioned, like a secondary index >>>>>>>>>>>>>>>> (possible via LSM >>>>>>>>>>>>>>>> tree), that would allow for better performance overall. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I generally agree with the sentiment that we shouldn't >>>>>>>>>>>>>>>> deprecate them until there is a viable replacement. With all >>>>>>>>>>>>>>>> due respect >>>>>>>>>>>>>>>> to my employer, let's not fall into the Google trap [1] :) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] https://goomics.net/50/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo < >>>>>>>>>>>>>>>> alex...@starburstdata.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hey all, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Just to throw my 2 cents in, I agree with Steven and >>>>>>>>>>>>>>>>> others that we do need some kind of replacement before >>>>>>>>>>>>>>>>> deprecating equality >>>>>>>>>>>>>>>>> deletes. >>>>>>>>>>>>>>>>> They certainly have their problems, and do significantly >>>>>>>>>>>>>>>>> increase complexity as they are now, but the writing of >>>>>>>>>>>>>>>>> position deletes is >>>>>>>>>>>>>>>>> too expensive for certain pipelines. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We've been investigating using equality deletes for some >>>>>>>>>>>>>>>>> of our workloads at Starburst, the key advantage we were >>>>>>>>>>>>>>>>> hoping to leverage >>>>>>>>>>>>>>>>> is cheap, effectively random access lookup deletes. >>>>>>>>>>>>>>>>> Say you have a UUID column that's unique in a table and >>>>>>>>>>>>>>>>> want to delete a row by UUID. With position deletes each >>>>>>>>>>>>>>>>> delete is >>>>>>>>>>>>>>>>> expensive without an index on that UUID. >>>>>>>>>>>>>>>>> With equality deletes each delete is cheap and while >>>>>>>>>>>>>>>>> reads/compaction is expensive but when updates are frequent >>>>>>>>>>>>>>>>> and reads are >>>>>>>>>>>>>>>>> sporadic that's a reasonable tradeoff. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Pretty much what Jason and Steven have already said. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Maybe there are some incremental improvements on equality >>>>>>>>>>>>>>>>> deletes or tips from similar systems that might alleviate >>>>>>>>>>>>>>>>> some of their >>>>>>>>>>>>>>>>> problems? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - Alex Jo >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 10:58 AM Steven Wu < >>>>>>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We probably all agree with the downside of equality >>>>>>>>>>>>>>>>>> deletes: it postpones all the work on the read path. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In theory, we can implement position deletes only in the >>>>>>>>>>>>>>>>>> Flink streaming writer. It would require the tracking of >>>>>>>>>>>>>>>>>> last committed >>>>>>>>>>>>>>>>>> data files per key, which can be stored in Flink state >>>>>>>>>>>>>>>>>> (checkpointed). This >>>>>>>>>>>>>>>>>> is obviously quite expensive/challenging, but possible. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I like to echo one benefit of equality deletes that >>>>>>>>>>>>>>>>>> Russel called out in the original email. Equality deletes >>>>>>>>>>>>>>>>>> would never >>>>>>>>>>>>>>>>>> have conflicts. that is important for streaming writers >>>>>>>>>>>>>>>>>> (Flink, Kafka >>>>>>>>>>>>>>>>>> connect, ...) that commit frequently (minutes or less). >>>>>>>>>>>>>>>>>> Assume Flink can >>>>>>>>>>>>>>>>>> write position deletes only and commit every 2 minutes. The >>>>>>>>>>>>>>>>>> long-running >>>>>>>>>>>>>>>>>> nature of streaming jobs can cause frequent commit conflicts >>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>> background delete compaction jobs. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Overall, the streaming upsert write is not a well solved >>>>>>>>>>>>>>>>>> problem in Iceberg. This probably affects all streaming >>>>>>>>>>>>>>>>>> engines (Flink, >>>>>>>>>>>>>>>>>> Kafka connect, Spark streaming, ...). We need to come up >>>>>>>>>>>>>>>>>> with some better >>>>>>>>>>>>>>>>>> alternatives before we can deprecate equality deletes. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 8:38 AM Russell Spitzer < >>>>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> For users of Equality Deletes, what are the key >>>>>>>>>>>>>>>>>>> benefits to Equality Deletes that you would like to >>>>>>>>>>>>>>>>>>> preserve and could you >>>>>>>>>>>>>>>>>>> please share some concrete examples of the queries you want >>>>>>>>>>>>>>>>>>> to run (and the >>>>>>>>>>>>>>>>>>> schemas and data sizes you would like to run them against) >>>>>>>>>>>>>>>>>>> and the >>>>>>>>>>>>>>>>>>> latencies that would be acceptable? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 10:05 AM Jason Fine >>>>>>>>>>>>>>>>>>> <ja...@upsolver.com.invalid> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Representing Upsolver here, we also make use of >>>>>>>>>>>>>>>>>>>> Equality Deletes to deliver high frequency low latency >>>>>>>>>>>>>>>>>>>> updates to our >>>>>>>>>>>>>>>>>>>> clients at scale. We have customers using them at scale >>>>>>>>>>>>>>>>>>>> and demonstrating >>>>>>>>>>>>>>>>>>>> the need and viability. We automate the process of >>>>>>>>>>>>>>>>>>>> converting them into >>>>>>>>>>>>>>>>>>>> positional deletes (or fully applying them) for more >>>>>>>>>>>>>>>>>>>> efficient engine >>>>>>>>>>>>>>>>>>>> queries in the background giving our users both low >>>>>>>>>>>>>>>>>>>> latency and good query >>>>>>>>>>>>>>>>>>>> performance. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Equality Deletes were added since there isn't a good >>>>>>>>>>>>>>>>>>>> way to solve frequent updates otherwise. It would require >>>>>>>>>>>>>>>>>>>> some sort of >>>>>>>>>>>>>>>>>>>> index keeping track of every record in the table (by a >>>>>>>>>>>>>>>>>>>> predetermined PK) >>>>>>>>>>>>>>>>>>>> and maintaining such an index is a huge task that every >>>>>>>>>>>>>>>>>>>> tool interested in >>>>>>>>>>>>>>>>>>>> this would need to re-implement. It also becomes a >>>>>>>>>>>>>>>>>>>> bottleneck limiting >>>>>>>>>>>>>>>>>>>> table sizes. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I don't think they should be removed without providing >>>>>>>>>>>>>>>>>>>> an alternative. Positional Deletes have a different >>>>>>>>>>>>>>>>>>>> performance profile >>>>>>>>>>>>>>>>>>>> inherently, requiring more upfront work proportional to >>>>>>>>>>>>>>>>>>>> the table size. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 2:45 PM Jean-Baptiste Onofré < >>>>>>>>>>>>>>>>>>>> j...@nanthrax.net> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Russell >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks for the nice writeup and the proposal. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I agree with your analysis, and I have the same >>>>>>>>>>>>>>>>>>>>> feeling. However, I >>>>>>>>>>>>>>>>>>>>> think there are more than Flink that write equality >>>>>>>>>>>>>>>>>>>>> delete files. So, >>>>>>>>>>>>>>>>>>>>> I agree to deprecate in V3, but maybe be more >>>>>>>>>>>>>>>>>>>>> "flexible" about removal >>>>>>>>>>>>>>>>>>>>> in V4 in order to give time to engines to update. >>>>>>>>>>>>>>>>>>>>> I think that by deprecating equality deletes, we are >>>>>>>>>>>>>>>>>>>>> clearly focusing >>>>>>>>>>>>>>>>>>>>> on read performance and "consistency" (more than >>>>>>>>>>>>>>>>>>>>> write). It's not >>>>>>>>>>>>>>>>>>>>> necessarily a bad thing but the streaming platform and >>>>>>>>>>>>>>>>>>>>> data ingestion >>>>>>>>>>>>>>>>>>>>> platforms will be probably concerned about that (by >>>>>>>>>>>>>>>>>>>>> using positional >>>>>>>>>>>>>>>>>>>>> deletes, they will have to scan/read all datafiles to >>>>>>>>>>>>>>>>>>>>> find the >>>>>>>>>>>>>>>>>>>>> position, so painful). >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> So, to summarize: >>>>>>>>>>>>>>>>>>>>> 1. Agree to deprecate equality deletes, but -1 to >>>>>>>>>>>>>>>>>>>>> commit any target >>>>>>>>>>>>>>>>>>>>> for deletion before having a clear path for streaming >>>>>>>>>>>>>>>>>>>>> platforms >>>>>>>>>>>>>>>>>>>>> (Flink, Beam, ...) >>>>>>>>>>>>>>>>>>>>> 2. In the meantime (during the deprecation period), I >>>>>>>>>>>>>>>>>>>>> propose to >>>>>>>>>>>>>>>>>>>>> explore possible improvements for streaming platforms >>>>>>>>>>>>>>>>>>>>> (maybe finding a >>>>>>>>>>>>>>>>>>>>> way to avoid full data files scan, ...) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks ! >>>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>>> JB >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Oct 30, 2024 at 10:06 PM Russell Spitzer >>>>>>>>>>>>>>>>>>>>> <russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Background: >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > 1) Position Deletes >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Writers determine what rows are deleted and mark >>>>>>>>>>>>>>>>>>>>> them in a 1 for 1 representation. With delete vectors >>>>>>>>>>>>>>>>>>>>> this means every data >>>>>>>>>>>>>>>>>>>>> file has at most 1 delete vector that it is read in >>>>>>>>>>>>>>>>>>>>> conjunction with to >>>>>>>>>>>>>>>>>>>>> excise deleted rows. Reader overhead is more or less >>>>>>>>>>>>>>>>>>>>> constant and is very >>>>>>>>>>>>>>>>>>>>> predictable. >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > The main cost of this mode is that deletes must be >>>>>>>>>>>>>>>>>>>>> determined at write time which is expensive and can be >>>>>>>>>>>>>>>>>>>>> more difficult for >>>>>>>>>>>>>>>>>>>>> conflict resolution >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > 2) Equality Deletes >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Writers write out reference to what values are >>>>>>>>>>>>>>>>>>>>> deleted (in a partition or globally). There can be an >>>>>>>>>>>>>>>>>>>>> unlimited number of >>>>>>>>>>>>>>>>>>>>> equality deletes and they all must be checked for every >>>>>>>>>>>>>>>>>>>>> data file that is >>>>>>>>>>>>>>>>>>>>> read. The cost of determining deleted rows is essentially >>>>>>>>>>>>>>>>>>>>> given to the >>>>>>>>>>>>>>>>>>>>> reader. >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Conflicts almost never happen since data files are >>>>>>>>>>>>>>>>>>>>> not actually changed and there is almost no cost to the >>>>>>>>>>>>>>>>>>>>> writer to generate >>>>>>>>>>>>>>>>>>>>> these. Almost all costs related to equality deletes are >>>>>>>>>>>>>>>>>>>>> passed on to the >>>>>>>>>>>>>>>>>>>>> reader. >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Proposal: >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Equality deletes are, in my opinion, unsustainable >>>>>>>>>>>>>>>>>>>>> and we should work on deprecating and removing them from >>>>>>>>>>>>>>>>>>>>> the specification. >>>>>>>>>>>>>>>>>>>>> At this time, I know of only one engine (Apache Flink) >>>>>>>>>>>>>>>>>>>>> which produces these >>>>>>>>>>>>>>>>>>>>> deletes but almost all engines have implementations to >>>>>>>>>>>>>>>>>>>>> read them. The cost >>>>>>>>>>>>>>>>>>>>> of implementing equality deletes on the read path is >>>>>>>>>>>>>>>>>>>>> difficult and >>>>>>>>>>>>>>>>>>>>> unpredictable in terms of memory usage and compute >>>>>>>>>>>>>>>>>>>>> complexity. We’ve had >>>>>>>>>>>>>>>>>>>>> suggestions of implementing rocksdb inorder to handle >>>>>>>>>>>>>>>>>>>>> ever growing sets of >>>>>>>>>>>>>>>>>>>>> equality deletes which in my opinion shows that we are >>>>>>>>>>>>>>>>>>>>> going down the wrong >>>>>>>>>>>>>>>>>>>>> path. >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Outside of performance, Equality deletes are also >>>>>>>>>>>>>>>>>>>>> difficult to use in conjunction with many other features. >>>>>>>>>>>>>>>>>>>>> For example, any >>>>>>>>>>>>>>>>>>>>> features requiring CDC or Row lineage are basically >>>>>>>>>>>>>>>>>>>>> impossible when >>>>>>>>>>>>>>>>>>>>> equality deletes are in use. When Equality deletes are >>>>>>>>>>>>>>>>>>>>> present, the state >>>>>>>>>>>>>>>>>>>>> of the table can only be determined with a full scan >>>>>>>>>>>>>>>>>>>>> making it difficult to >>>>>>>>>>>>>>>>>>>>> update differential structures. This means materialized >>>>>>>>>>>>>>>>>>>>> views or indexes >>>>>>>>>>>>>>>>>>>>> need to essentially be fully rebuilt whenever an equality >>>>>>>>>>>>>>>>>>>>> delete is added >>>>>>>>>>>>>>>>>>>>> to the table. >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Equality deletes essentially remove complexity from >>>>>>>>>>>>>>>>>>>>> the write side but then add what I believe is an >>>>>>>>>>>>>>>>>>>>> unacceptable level of >>>>>>>>>>>>>>>>>>>>> complexity to the read side. >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Because of this I suggest we deprecate Equality >>>>>>>>>>>>>>>>>>>>> Deletes in V3 and slate them for full removal from the >>>>>>>>>>>>>>>>>>>>> Iceberg Spec in V4. >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > I know this is a big change and compatibility >>>>>>>>>>>>>>>>>>>>> breakage so I would like to introduce this idea to the >>>>>>>>>>>>>>>>>>>>> community and >>>>>>>>>>>>>>>>>>>>> solicit feedback from all stakeholders. I am very >>>>>>>>>>>>>>>>>>>>> flexible on this issue >>>>>>>>>>>>>>>>>>>>> and would like to hear the best issues both for and >>>>>>>>>>>>>>>>>>>>> against removal of >>>>>>>>>>>>>>>>>>>>> Equality Deletes. >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Thanks everyone for your time, >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > Russ Spitzer >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *Jason Fine* >>>>>>>>>>>>>>>>>>>> Chief Software Architect >>>>>>>>>>>>>>>>>>>> ja...@upsolver.com | www.upsolver.com >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>