Good points Micah, A few additional points: - Russell mentioned partial updates as a possible feature - I would like to see that it is possible to read the current content of the table (with all of the committed changes) without an engine - no complicated joins / high amount of data in memory / etc
Micah Kornfield <emkornfi...@gmail.com> ezt írta (időpont: 2024. nov. 20., Sze, 0:56): > The key here is that you only use position deletes on your delta table >> which you keep small, say 1gb or less. > > > Would this cause issues operationally for either a high enough sustained > throughput of streamed data or if the maintenance process of moving the > data out of the delta table has an outage? > > - How would the solution handle the double updates? >> All updates to the delta table are merge/upserts on primary keys. >> Multiple updates are just like normal updates on a small table. Scan and >> add position deletes > > > I'm not sure if this is what Peter is asking but I think position delete > effectively requires that updates for a single row be isolated in a single > snapshot? This seems like it is potentially difficult to implement in a > distributed context. > > I think Jack's document does a good job [1] does a good job outlining a > lot of the challenges/trade-offs. Before searching for a specific solution > it might be worth agreeing on the requirements for upsert/CDC use-cases. > Off the top of my head I suggest a few (these are of open to debate and not > complete): > > 1. Inserts latency for the table O(data inserted) and does not depend on > the amount of data already in the table. > 2. Writers should be able to write multiple row mutations in a single > snapshot (with some form of additional disambiguator to identify ordering). > 3. Upserts require a table that has a primary key. > 4. Readers should be able to trade-off query costs vs data freshness > (e.g. like BigQuery's max-staleness > <https://cloud.google.com/bigquery/docs/change-data-capture> [2] or > Hudi's query types > <https://hudi.apache.org/docs/next/table_types/#query-types> [3]) and > inspect the tables to understand the trade-off. > 5. Upserts operations should be compatible with Row identifiers. > > Thanks, > Micah > > [1] > https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit?tab=t.0 > [2] https://cloud.google.com/bigquery/docs/change-data-capture > [3] https://hudi.apache.org/docs/next/table_types/#query-types > > > > On Tue, Nov 19, 2024 at 10:21 AM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> - How would the Delta table look like? >> Delta Table is just another Iceberg table with the exact same schema as >> the base table (it could possibly skip partitioning since we expect it to >> stay very small) >> >> - Would it just contain the whole new record? >> It could, doesn't have to. The key is that any values in the Delta table >> are selected over any records in the base tab >> >> - How would the solution handle the double updates? >> All updates to the delta table are merge/upserts on primary keys. >> Multiple updates are just like normal updates on a small table. Scan and >> add position deletes >> >> - Would it just write a second version of the record to the Delta table? >> No (at least not in my plan) >> >> - How would the solution handle Deletes? >> This is debatable. We can add a tombstone cell to the schema (Ie if this >> cell is set , ignore all values in base and remove key) >> >> - Special row, or marker for the deleted id? >> >> >> If we add a solution for all of these complexity, I'm afraid that we >> arrive a solution which is very similar to the current one, especially if >> we consider that some readers need to do a single step read (reading a data >> file and emitting only the records which are still in the table) >> >> At best, this solution removes the need for the equality delete, but then >> single step readers of the old data files need to read all of the new >> records, to ensure that they don't emitt already updated data. Which is >> worse than reading small equality deletes files. >> >> >> The key here is that you only use position deletes on your delta table >> which you keep small, say 1gb or less. Engines can (if they like) cache >> this information and determine position deletes very easily. The key again >> here is that we have no need for any equality deletes and the rules for >> resolving updated rows are much more strict and easier than equality >> deletes. We could also allow readers who only want to do a single step >> read to do a "last merged" read which just checks the base table. Again >> this is better than the current situation since this would be constant time >> rather than scaling with equality deletes. >> >> On Tue, Nov 19, 2024 at 11:46 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> Hi Team, >>> >>> I have a few questions about the Delta table: >>> - How would the Delta table look like? >>> - Would it just contain the whole new record? >>> >>> - How would the solution handle the double updates? >>> - Would it just write a second version of the record to the Delta table? >>> >>> - How would the solution handle Deletes? >>> - Special row, or marker for the deleted id? >>> >>> If we add a solution for all of these complexity, I'm afraid that we >>> arrive a solution which is very similar to the current one, especially if >>> we consider that some readers need to do a single step read (reading a data >>> file and emitting only the records which are still in the table) >>> >>> At best, this solution removes the need for the equality delete, but >>> then single step readers of the old data files need to read all of the new >>> records, to ensure that they don't emitt already updated data. Which is >>> worse than reading small equality deletes files. >>> >>> Equality deletes could also be considered as a table which should be >>> removed from the table containing old data. >>> >>> Thanks, Peter >>> >>> On Tue, Nov 19, 2024, 17:46 Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> The proposal sounds similar to the Delta Lake CDC feature with CDC file >>>> type [1] and CDC action [2]. >>>> >>>> There was also the proposal I wrote a long time ago [3] to use a "cdc" >>>> branch rather than 2 private tables, which was inspired by the Delta Lake >>>> approach. The feedback was mixed at that time because on one side at least >>>> the user does not need to have a 2 table setup and it is still considered >>>> doing CDC against one Iceberg table, but on the other side branching >>>> construct was not supported widely with enterprise level ETL and governance >>>> features, and having 2 tables might just be cleaner after all. But we have >>>> seen customers implementing the cdc branch approach in the proposal and it >>>> was successful. >>>> >>>> Either way, in a Delta table based CDC approach, for the reader, we >>>> could choose the view approach Russell described above, or develop a reader >>>> that does essentially a broadcast join at scan level. >>>> >>>> Overall, I think we have a lot of options on the table to solve CDC in >>>> Iceberg for both read and write. >>>> >>>> Given the row lineage feature is fundamentally in conflict with the >>>> equality deletes, I would +1 for dropping equality delete support. >>>> >>>> Best, >>>> Jack Ye >>>> >>>> [1] >>>> https://github.com/delta-io/delta/blob/master/PROTOCOL.md#change-data-files >>>> [2] >>>> https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-cdc-file >>>> [3] >>>> https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit?tab=t.0 >>>> >>>> >>>> On Tue, Nov 19, 2024 at 7:56 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> I'm strongly in favor of moving to the Delta + Base table approach >>>>> discussed in the cookbook above. I wonder if we should codify that into >>>>> something more standardized but it seems to me to be a much better path >>>>> forward. I'm not sure we need to support his at the spec level but it >>>>> would >>>>> be nice if we could provide a table that automatically was broken into sub >>>>> tables and had well defined operations on it. >>>>> >>>>> For example: >>>>> >>>>> FastUpdateTable: >>>>> Requires: >>>>> Primary Key Columns >>>>> Long Max Delta Size >>>>> Contains: >>>>> Private Iceberg Table: Delta >>>>> Private Iceberg Table: Base >>>>> >>>>> On All Scans - >>>>> Return a view which joins delta and base on primary key, if >>>>> Delta has a record for a given primary key discard the base record >>>>> >>>>> On All Writes - >>>>> Perform all writes against the delta table, only MERGE is >>>>> allowed. Append is forbidden (No PK Guarantees) Only position deletes are >>>>> allowed. >>>>> >>>>> On Delta Table Size Max Delta Size- - >>>>> Upsert DELTA into BASE >>>>> Clear upserted records from Delta >>>>> >>>>> >>>>> If the Delta Table size is kept small I think this would be almost as >>>>> performant as Equality deletes but still be compatible with row-lineage >>>>> and >>>>> other indexing features. >>>>> >>>>> >>>>> On Tue, Nov 19, 2024 at 7:12 AM Manu Zhang <owenzhang1...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Ajantha, >>>>>> >>>>>> I'm proposing exploring a view-based approach similar to the >>>>>> changelog-mirror table pattern[1] rather than supporting delta writers >>>>>> for >>>>>> Kafka connect Iceberg sink. >>>>>> >>>>>> 1. >>>>>> https://www.tabular.io/apache-iceberg-cookbook/data-engineering-cdc-table-mirroring/ >>>>>> >>>>>> On Tue, Nov 19, 2024 at 7:38 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>>>> wrote: >>>>>> >>>>>>> I don’t think it’s a problem while an alternative is explored (the >>>>>>> JDK itself does that pretty often). >>>>>>> So it’s up to the community: of course I’m against removing it >>>>>>> without solid alternative, but deprecation is fine imho. >>>>>>> >>>>>>> Regards >>>>>>> JB >>>>>>> >>>>>>> Le mar. 19 nov. 2024 à 12:19, Ajantha Bhat <ajanthab...@gmail.com> >>>>>>> a écrit : >>>>>>> >>>>>>>> - ok for deprecate equality deletes >>>>>>>>> - not ok to remove it >>>>>>>> >>>>>>>> >>>>>>>> @JB: I don't think it is a good idea to use deprecated >>>>>>>> functionality in the new feature development. >>>>>>>> Hence, my specific question was about kafka connect upsert >>>>>>>> operation. >>>>>>>> >>>>>>>> @Manu: I meant the delta writers for kafka connect Iceberg sink >>>>>>>> (which in turn used for upsetting the CDC records) >>>>>>>> https://github.com/apache/iceberg/issues/10842 >>>>>>>> >>>>>>>> >>>>>>>> - Ajantha >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Nov 19, 2024 at 3:08 PM Manu Zhang <owenzhang1...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I second Anton's proposal to standardize on a view-based approach >>>>>>>>> to handle CDC cases. >>>>>>>>> Actually, it's already been explored in detail[1] by Jack before. >>>>>>>>> >>>>>>>>> [1] Improving Change Data Capture Use Case for Apache Iceberg >>>>>>>>> <https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit?tab=t.0#heading=h.94xnx4qg3bnt> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Nov 19, 2024 at 4:16 PM Jean-Baptiste Onofré < >>>>>>>>> j...@nanthrax.net> wrote: >>>>>>>>> >>>>>>>>>> My proposal is the following (already expressed): >>>>>>>>>> - ok for deprecate equality deletes >>>>>>>>>> - not ok to remove it >>>>>>>>>> - work on position deletes improvements to address streaming use >>>>>>>>>> cases. I think we should explore different approaches. Personally I >>>>>>>>>> think a >>>>>>>>>> possible approach would be to find index way to data files to avoid >>>>>>>>>> full >>>>>>>>>> scan to find row position. >>>>>>>>>> >>>>>>>>>> My $0.01 :) >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> JB >>>>>>>>>> >>>>>>>>>> Le mar. 19 nov. 2024 à 07:53, Ajantha Bhat <ajanthab...@gmail.com> >>>>>>>>>> a écrit : >>>>>>>>>> >>>>>>>>>>> Hi, What's the conclusion on this thread? >>>>>>>>>>> >>>>>>>>>>> Users are looking for Upsert (CDC) support for OSS Iceberg >>>>>>>>>>> kafka connect sink. >>>>>>>>>>> We only support appends at the moment. Can we go ahead and >>>>>>>>>>> implement the upserts using equality deletes? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Ajantha >>>>>>>>>>> >>>>>>>>>>> On Sun, Nov 10, 2024 at 11:56 AM Vignesh <vignesh.v...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> I am reading about iceberg and am quite new to this. >>>>>>>>>>>> This puffin would be an index from key to data file. Other use >>>>>>>>>>>> cases of Puffin, such as statistics are at a per file level if I >>>>>>>>>>>> understand >>>>>>>>>>>> correctly. >>>>>>>>>>>> >>>>>>>>>>>> Where would the puffin about key->data file be stored? It is a >>>>>>>>>>>> property of the entire table. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Vignesh. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Nov 9, 2024 at 2:17 AM Shani Elharrar >>>>>>>>>>>> <sh...@upsolver.com.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> JB, this is what we do, we write Equality Deletes and >>>>>>>>>>>>> periodically convert them to Positional Deletes. >>>>>>>>>>>>> >>>>>>>>>>>>> We could probably index the keys, maybe partially index using >>>>>>>>>>>>> bloom filters, the best would be to put those bloom filters >>>>>>>>>>>>> inside puffin. >>>>>>>>>>>>> >>>>>>>>>>>>> Shani. >>>>>>>>>>>>> >>>>>>>>>>>>> On 9 Nov 2024, at 11:11, Jean-Baptiste Onofré <j...@nanthrax.net> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I agree with Peter here, and I would say that it would be an >>>>>>>>>>>>> issue for multi-engine support. >>>>>>>>>>>>> >>>>>>>>>>>>> I think, as I already mentioned with others, we should explore >>>>>>>>>>>>> an alternative. >>>>>>>>>>>>> As the main issue is the datafile scan in streaming context, >>>>>>>>>>>>> maybe we could find a way to "index"/correlate for positional >>>>>>>>>>>>> deletes with >>>>>>>>>>>>> limited scanning. >>>>>>>>>>>>> I will think again about that :) >>>>>>>>>>>>> >>>>>>>>>>>>> Regards >>>>>>>>>>>>> JB >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Nov 9, 2024 at 6:48 AM Péter Váry < >>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Imran, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don't think it's a good idea to start creating multiple >>>>>>>>>>>>>> types of Iceberg tables. Iceberg's main selling point is >>>>>>>>>>>>>> compatibility >>>>>>>>>>>>>> between engines. If we don't have readers and writers for all >>>>>>>>>>>>>> types of >>>>>>>>>>>>>> tables, then we remove compatibility from the equation and >>>>>>>>>>>>>> engine specific >>>>>>>>>>>>>> formats always win. OTOH, if we write readers and writers for >>>>>>>>>>>>>> all types of >>>>>>>>>>>>>> tables then we are back on square one. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Identifier fields are a table schema concept and used in many >>>>>>>>>>>>>> cases during query planning and execution. This is why they are >>>>>>>>>>>>>> defined as >>>>>>>>>>>>>> part of the SQL spec, and this is why Iceberg defines them as >>>>>>>>>>>>>> well. One use >>>>>>>>>>>>>> case is where they can be used to merge deletes (independently >>>>>>>>>>>>>> of how they >>>>>>>>>>>>>> are manifested) and subsequent inserts, into updates. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Flink SQL doesn't allow creating tables with partition >>>>>>>>>>>>>> transforms, so no new table could be created by Flink SQL using >>>>>>>>>>>>>> transforms, >>>>>>>>>>>>>> but tables created by other engines could still be used (both >>>>>>>>>>>>>> read an >>>>>>>>>>>>>> write). Also you can create such tables in Flink using the Java >>>>>>>>>>>>>> API. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Requiring partition columns be part of the identifier fields >>>>>>>>>>>>>> is coming from the practical consideration, that you want to >>>>>>>>>>>>>> limit the >>>>>>>>>>>>>> scope of the equality deletes as much as possible. Otherwise all >>>>>>>>>>>>>> of the >>>>>>>>>>>>>> equality deletes should be table global, and they should be read >>>>>>>>>>>>>> by every >>>>>>>>>>>>>> reader. We could write those, we just decided that we don't want >>>>>>>>>>>>>> to allow >>>>>>>>>>>>>> the user to do this, as it is most cases a bad idea. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I hope this helps, >>>>>>>>>>>>>> Peter >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Nov 8, 2024, 22:01 Imran Rashid >>>>>>>>>>>>>> <iras...@cloudera.com.invalid> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm not down in the weeds at all myself on implementation >>>>>>>>>>>>>>> details, so forgive me if I'm wrong about the details here. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I can see all the viewpoints -- both that equality deletes >>>>>>>>>>>>>>> enable some use cases, but also make others far more difficult. >>>>>>>>>>>>>>> What surprised me the most is that Iceberg does not provide a >>>>>>>>>>>>>>> way to >>>>>>>>>>>>>>> distinguish these two table "types". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> At first, I thought the presence of an identifier-field ( >>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#identifier-field-ids) >>>>>>>>>>>>>>> indicated that the table was a target for equality deletes. >>>>>>>>>>>>>>> But, then it >>>>>>>>>>>>>>> turns out identifier-fields are also useful for changelog views >>>>>>>>>>>>>>> even >>>>>>>>>>>>>>> without equality deletes -- IIUC, they show that a delete + >>>>>>>>>>>>>>> insert should >>>>>>>>>>>>>>> actually be interpreted as an update in changelog view. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To be perfectly honest, I'm confused about all of these >>>>>>>>>>>>>>> details -- from my read, the spec does not indicate this >>>>>>>>>>>>>>> relationship >>>>>>>>>>>>>>> between identifier-fields and equality_ids in equality delete >>>>>>>>>>>>>>> files ( >>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#equality-delete-files), >>>>>>>>>>>>>>> but I think that is the way Flink works. Flink itself seems to >>>>>>>>>>>>>>> have even >>>>>>>>>>>>>>> more limitations -- no partition transforms are allowed, and >>>>>>>>>>>>>>> all partition >>>>>>>>>>>>>>> columns must be a subset of the identifier fields. Is that >>>>>>>>>>>>>>> just a Flink >>>>>>>>>>>>>>> limitation, or is that the intended behavior in the spec? (Or >>>>>>>>>>>>>>> maybe >>>>>>>>>>>>>>> user-error on my part?) Those seem like very reasonable >>>>>>>>>>>>>>> limitations, from >>>>>>>>>>>>>>> an implementation point-of-view. But OTOH, as a user, this >>>>>>>>>>>>>>> seems to be >>>>>>>>>>>>>>> directly contrary to some of the promises of Iceberg. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Its easy to see if a table already has equality deletes in >>>>>>>>>>>>>>> it, by looking at the metadata. But is there any way to >>>>>>>>>>>>>>> indicate that a >>>>>>>>>>>>>>> table (or branch of a table) _must not_ have equality deletes >>>>>>>>>>>>>>> added to it? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If that were possible, it seems like we could support both >>>>>>>>>>>>>>> use cases. We could continue to optimize for the streaming >>>>>>>>>>>>>>> ingestion use >>>>>>>>>>>>>>> cases using equality deletes. But we could also build more >>>>>>>>>>>>>>> optimizations >>>>>>>>>>>>>>> into the "non-streaming-ingestion" branches. And we could >>>>>>>>>>>>>>> document the >>>>>>>>>>>>>>> tradeoff so it is much clearer to end users. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To maintain compatibility, I suppose that the change would >>>>>>>>>>>>>>> be that equality deletes continue to be allowed by default, but >>>>>>>>>>>>>>> we'd add a >>>>>>>>>>>>>>> new field to indicate that for some tables (or branches of a >>>>>>>>>>>>>>> table), >>>>>>>>>>>>>>> equality deletes would not be allowed. And it would be an >>>>>>>>>>>>>>> error for an >>>>>>>>>>>>>>> engine to make an update which added an equality delete to such >>>>>>>>>>>>>>> a table. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Maybe that change would even be possible in V3. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> And if all the performance improvements to equality deletes >>>>>>>>>>>>>>> make this a moot point, we could drop the field in v4. But it >>>>>>>>>>>>>>> seems like a >>>>>>>>>>>>>>> mistake to both limit the non-streaming use-case AND have >>>>>>>>>>>>>>> confusing >>>>>>>>>>>>>>> limitations for the end-user in the meantime. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would happily be corrected about my understanding of all >>>>>>>>>>>>>>> of the above. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> thanks! >>>>>>>>>>>>>>> Imran >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Nov 5, 2024 at 9:16 AM Bryan Keller < >>>>>>>>>>>>>>> brya...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I also feel we should keep equality deletes until we have >>>>>>>>>>>>>>>> an alternative solution for streaming updates/deletes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -Bryan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Nov 4, 2024, at 8:33 AM, Péter Váry < >>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Well, it seems like I'm a little late, so most of the >>>>>>>>>>>>>>>> arguments are voiced. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I agree that we should not deprecate the equality deletes >>>>>>>>>>>>>>>> until we have a replacement feature. >>>>>>>>>>>>>>>> I think one of the big advantages of Iceberg is that it >>>>>>>>>>>>>>>> supports batch processing and streaming ingestion too. >>>>>>>>>>>>>>>> For streaming ingestion we need a way to update existing >>>>>>>>>>>>>>>> data in a performant way, but restricting deletes for the >>>>>>>>>>>>>>>> primary keys >>>>>>>>>>>>>>>> seems like enough from the streaming perspective. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Equality deletes allow a very wide range of applications, >>>>>>>>>>>>>>>> which we might be able to narrow down a bit, but still keep >>>>>>>>>>>>>>>> useful. So if >>>>>>>>>>>>>>>> we want to go down this road, we need to start collecting the >>>>>>>>>>>>>>>> requirements. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Peter >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Shani Elharrar <sh...@upsolver.com.invalid> ezt írta >>>>>>>>>>>>>>>> (időpont: 2024. nov. 1., P, 19:22): >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I understand how it makes sense for batch jobs, but it >>>>>>>>>>>>>>>>> damages stream jobs, using equality deletes works much better >>>>>>>>>>>>>>>>> for streaming >>>>>>>>>>>>>>>>> (which have a strict SLA for delays), and in order to >>>>>>>>>>>>>>>>> decrease the >>>>>>>>>>>>>>>>> performance penalty - systems can rewrite the equality >>>>>>>>>>>>>>>>> deletes to >>>>>>>>>>>>>>>>> positional deletes. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Shani. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 1 Nov 2024, at 20:06, Steven Wu <stevenz...@gmail.com> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Fundamentally, it is very difficult to write position >>>>>>>>>>>>>>>>> deletes with concurrent writers and conflicts for batch jobs >>>>>>>>>>>>>>>>> too, as the >>>>>>>>>>>>>>>>> inverted index may become invalid/stale. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The position deletes are created during the write phase. >>>>>>>>>>>>>>>>> But conflicts are only detected at the commit stage. I assume >>>>>>>>>>>>>>>>> the batch job >>>>>>>>>>>>>>>>> should fail in this case. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Nov 1, 2024 at 10:57 AM Steven Wu < >>>>>>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Shani, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> That is a good point. It is certainly a limitation for >>>>>>>>>>>>>>>>>> the Flink job to track the inverted index internally (which >>>>>>>>>>>>>>>>>> is what I had >>>>>>>>>>>>>>>>>> in mind). It can't be shared/synchronized with other Flink >>>>>>>>>>>>>>>>>> jobs or other >>>>>>>>>>>>>>>>>> engines writing to the same table. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Nov 1, 2024 at 10:50 AM Shani Elharrar >>>>>>>>>>>>>>>>>> <sh...@upsolver.com.invalid> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Even if Flink can create this state, it would have to be >>>>>>>>>>>>>>>>>>> maintained against the Iceberg table, we wouldn't like >>>>>>>>>>>>>>>>>>> duplicates (keys) if >>>>>>>>>>>>>>>>>>> other systems / users update the table (e.g manual insert / >>>>>>>>>>>>>>>>>>> updates using >>>>>>>>>>>>>>>>>>> DML). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Shani. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 1 Nov 2024, at 18:32, Steven Wu <stevenz...@gmail.com> >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> > Add support for inverted indexes to reduce the cost of >>>>>>>>>>>>>>>>>>> position lookup. This is fairly tricky to implement for >>>>>>>>>>>>>>>>>>> streaming use cases >>>>>>>>>>>>>>>>>>> without an external system. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Anton, that is also what I was saying earlier. In Flink, >>>>>>>>>>>>>>>>>>> the inverted index of (key, committed data files) can be >>>>>>>>>>>>>>>>>>> tracked in Flink >>>>>>>>>>>>>>>>>>> state. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi < >>>>>>>>>>>>>>>>>>> aokolnyc...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I was a bit skeptical when we were adding equality >>>>>>>>>>>>>>>>>>>> deletes, but nothing beats their performance during >>>>>>>>>>>>>>>>>>>> writes. We have to find >>>>>>>>>>>>>>>>>>>> an alternative before deprecating. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> We are doing a lot of work to improve streaming, like >>>>>>>>>>>>>>>>>>>> reducing the cost of commits, enabling a large >>>>>>>>>>>>>>>>>>>> (potentially infinite) >>>>>>>>>>>>>>>>>>>> number of snapshots, changelog reads, and so on. It is a >>>>>>>>>>>>>>>>>>>> project goal to >>>>>>>>>>>>>>>>>>>> excel in streaming. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I was going to focus on equality deletes after >>>>>>>>>>>>>>>>>>>> completing the DV work. I believe we have these options: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - Revisit the existing design of equality deletes (e.g. >>>>>>>>>>>>>>>>>>>> add more restrictions, improve compaction, offer new >>>>>>>>>>>>>>>>>>>> writers). >>>>>>>>>>>>>>>>>>>> - Standardize on the view-based approach [1] to handle >>>>>>>>>>>>>>>>>>>> streaming upserts and CDC use cases, potentially making >>>>>>>>>>>>>>>>>>>> this part of the >>>>>>>>>>>>>>>>>>>> spec. >>>>>>>>>>>>>>>>>>>> - Add support for inverted indexes to reduce the cost >>>>>>>>>>>>>>>>>>>> of position lookup. This is fairly tricky to implement for >>>>>>>>>>>>>>>>>>>> streaming use >>>>>>>>>>>>>>>>>>>> cases without an external system. Our runtime filtering in >>>>>>>>>>>>>>>>>>>> Spark today is >>>>>>>>>>>>>>>>>>>> equivalent to looking up positions in an inverted index >>>>>>>>>>>>>>>>>>>> represented by >>>>>>>>>>>>>>>>>>>> another Iceberg table. That may still not be enough for >>>>>>>>>>>>>>>>>>>> some streaming use >>>>>>>>>>>>>>>>>>>> cases. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1] - https://www.tabular.io/blog/hello-world-of-cdc/ >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - Anton >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> чт, 31 жовт. 2024 р. о 21:31 Micah Kornfield < >>>>>>>>>>>>>>>>>>>> emkornfi...@gmail.com> пише: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I agree that equality deletes have their place in >>>>>>>>>>>>>>>>>>>>> streaming. I think the ultimate decision here is how >>>>>>>>>>>>>>>>>>>>> opinionated >>>>>>>>>>>>>>>>>>>>> Iceberg wants to be on its use-cases. If it really wants >>>>>>>>>>>>>>>>>>>>> to stick to its >>>>>>>>>>>>>>>>>>>>> origins of "slow moving data", then removing equality >>>>>>>>>>>>>>>>>>>>> deletes would be >>>>>>>>>>>>>>>>>>>>> inline with this. I think the other high level question >>>>>>>>>>>>>>>>>>>>> is how much we >>>>>>>>>>>>>>>>>>>>> allow for partially compatible features (the row lineage >>>>>>>>>>>>>>>>>>>>> use-case feature >>>>>>>>>>>>>>>>>>>>> was explicitly approved excluding equality deletes, and >>>>>>>>>>>>>>>>>>>>> people seemed OK >>>>>>>>>>>>>>>>>>>>> with it at the time. If all features need to work >>>>>>>>>>>>>>>>>>>>> together, then maybe we >>>>>>>>>>>>>>>>>>>>> need to rethink the design here so it can be forward >>>>>>>>>>>>>>>>>>>>> compatible with >>>>>>>>>>>>>>>>>>>>> equality deletes). >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I think one issue with equality deletes as stated in >>>>>>>>>>>>>>>>>>>>> the spec is that they are overly broad. I'd be >>>>>>>>>>>>>>>>>>>>> interested if people have >>>>>>>>>>>>>>>>>>>>> any use cases that differ, but I think one way of >>>>>>>>>>>>>>>>>>>>> narrowing (and probably a >>>>>>>>>>>>>>>>>>>>> necessary building block for building something better) >>>>>>>>>>>>>>>>>>>>> the specification >>>>>>>>>>>>>>>>>>>>> scope on equality deletes is to focus on upsert/Streaming >>>>>>>>>>>>>>>>>>>>> deletes. Two >>>>>>>>>>>>>>>>>>>>> proposals in this regard are: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 1. Require that equality deletes can only correspond >>>>>>>>>>>>>>>>>>>>> to unique identifiers for the table. >>>>>>>>>>>>>>>>>>>>> 2. Consider requiring that for equality deletes on >>>>>>>>>>>>>>>>>>>>> partitioned tables, that the primary key must contain a >>>>>>>>>>>>>>>>>>>>> partition column (I >>>>>>>>>>>>>>>>>>>>> believe Flink at least already does this). It is less >>>>>>>>>>>>>>>>>>>>> clear to me that >>>>>>>>>>>>>>>>>>>>> this would meet all existing use-cases. But having this >>>>>>>>>>>>>>>>>>>>> would allow for >>>>>>>>>>>>>>>>>>>>> better incremental data-structures, which could then be >>>>>>>>>>>>>>>>>>>>> partition based. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Narrow scope to unique identifiers would allow for >>>>>>>>>>>>>>>>>>>>> further building blocks already mentioned, like a >>>>>>>>>>>>>>>>>>>>> secondary index (possible >>>>>>>>>>>>>>>>>>>>> via LSM tree), that would allow for better performance >>>>>>>>>>>>>>>>>>>>> overall. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I generally agree with the sentiment that we shouldn't >>>>>>>>>>>>>>>>>>>>> deprecate them until there is a viable replacement. With >>>>>>>>>>>>>>>>>>>>> all due respect >>>>>>>>>>>>>>>>>>>>> to my employer, let's not fall into the Google trap [1] :) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> [1] https://goomics.net/50/ >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo < >>>>>>>>>>>>>>>>>>>>> alex...@starburstdata.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hey all, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Just to throw my 2 cents in, I agree with Steven and >>>>>>>>>>>>>>>>>>>>>> others that we do need some kind of replacement before >>>>>>>>>>>>>>>>>>>>>> deprecating equality >>>>>>>>>>>>>>>>>>>>>> deletes. >>>>>>>>>>>>>>>>>>>>>> They certainly have their problems, and do >>>>>>>>>>>>>>>>>>>>>> significantly increase complexity as they are now, but >>>>>>>>>>>>>>>>>>>>>> the writing of >>>>>>>>>>>>>>>>>>>>>> position deletes is too expensive for certain pipelines. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> We've been investigating using equality deletes for >>>>>>>>>>>>>>>>>>>>>> some of our workloads at Starburst, the key advantage we >>>>>>>>>>>>>>>>>>>>>> were hoping to >>>>>>>>>>>>>>>>>>>>>> leverage is cheap, effectively random access lookup >>>>>>>>>>>>>>>>>>>>>> deletes. >>>>>>>>>>>>>>>>>>>>>> Say you have a UUID column that's unique in a table >>>>>>>>>>>>>>>>>>>>>> and want to delete a row by UUID. With position deletes >>>>>>>>>>>>>>>>>>>>>> each delete is >>>>>>>>>>>>>>>>>>>>>> expensive without an index on that UUID. >>>>>>>>>>>>>>>>>>>>>> With equality deletes each delete is cheap and while >>>>>>>>>>>>>>>>>>>>>> reads/compaction is expensive but when updates are >>>>>>>>>>>>>>>>>>>>>> frequent and reads are >>>>>>>>>>>>>>>>>>>>>> sporadic that's a reasonable tradeoff. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Pretty much what Jason and Steven have already said. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Maybe there are some incremental improvements on >>>>>>>>>>>>>>>>>>>>>> equality deletes or tips from similar systems that might >>>>>>>>>>>>>>>>>>>>>> alleviate some of >>>>>>>>>>>>>>>>>>>>>> their problems? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> - Alex Jo >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 10:58 AM Steven Wu < >>>>>>>>>>>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> We probably all agree with the downside of equality >>>>>>>>>>>>>>>>>>>>>>> deletes: it postpones all the work on the read path. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> In theory, we can implement position deletes only in >>>>>>>>>>>>>>>>>>>>>>> the Flink streaming writer. It would require the >>>>>>>>>>>>>>>>>>>>>>> tracking of last committed >>>>>>>>>>>>>>>>>>>>>>> data files per key, which can be stored in Flink state >>>>>>>>>>>>>>>>>>>>>>> (checkpointed). This >>>>>>>>>>>>>>>>>>>>>>> is obviously quite expensive/challenging, but possible. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I like to echo one benefit of equality deletes that >>>>>>>>>>>>>>>>>>>>>>> Russel called out in the original email. Equality >>>>>>>>>>>>>>>>>>>>>>> deletes would never >>>>>>>>>>>>>>>>>>>>>>> have conflicts. that is important for streaming writers >>>>>>>>>>>>>>>>>>>>>>> (Flink, Kafka >>>>>>>>>>>>>>>>>>>>>>> connect, ...) that commit frequently (minutes or less). >>>>>>>>>>>>>>>>>>>>>>> Assume Flink can >>>>>>>>>>>>>>>>>>>>>>> write position deletes only and commit every 2 minutes. >>>>>>>>>>>>>>>>>>>>>>> The long-running >>>>>>>>>>>>>>>>>>>>>>> nature of streaming jobs can cause frequent commit >>>>>>>>>>>>>>>>>>>>>>> conflicts with >>>>>>>>>>>>>>>>>>>>>>> background delete compaction jobs. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Overall, the streaming upsert write is not a well >>>>>>>>>>>>>>>>>>>>>>> solved problem in Iceberg. This probably affects all >>>>>>>>>>>>>>>>>>>>>>> streaming engines >>>>>>>>>>>>>>>>>>>>>>> (Flink, Kafka connect, Spark streaming, ...). We need >>>>>>>>>>>>>>>>>>>>>>> to come up with some >>>>>>>>>>>>>>>>>>>>>>> better alternatives before we can deprecate equality >>>>>>>>>>>>>>>>>>>>>>> deletes. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 8:38 AM Russell Spitzer < >>>>>>>>>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> For users of Equality Deletes, what are the key >>>>>>>>>>>>>>>>>>>>>>>> benefits to Equality Deletes that you would like to >>>>>>>>>>>>>>>>>>>>>>>> preserve and could you >>>>>>>>>>>>>>>>>>>>>>>> please share some concrete examples of the queries you >>>>>>>>>>>>>>>>>>>>>>>> want to run (and the >>>>>>>>>>>>>>>>>>>>>>>> schemas and data sizes you would like to run them >>>>>>>>>>>>>>>>>>>>>>>> against) and the >>>>>>>>>>>>>>>>>>>>>>>> latencies that would be acceptable? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 10:05 AM Jason Fine >>>>>>>>>>>>>>>>>>>>>>>> <ja...@upsolver.com.invalid> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Representing Upsolver here, we also make use of >>>>>>>>>>>>>>>>>>>>>>>>> Equality Deletes to deliver high frequency low >>>>>>>>>>>>>>>>>>>>>>>>> latency updates to our >>>>>>>>>>>>>>>>>>>>>>>>> clients at scale. We have customers using them at >>>>>>>>>>>>>>>>>>>>>>>>> scale and demonstrating >>>>>>>>>>>>>>>>>>>>>>>>> the need and viability. We automate the process of >>>>>>>>>>>>>>>>>>>>>>>>> converting them into >>>>>>>>>>>>>>>>>>>>>>>>> positional deletes (or fully applying them) for more >>>>>>>>>>>>>>>>>>>>>>>>> efficient engine >>>>>>>>>>>>>>>>>>>>>>>>> queries in the background giving our users both low >>>>>>>>>>>>>>>>>>>>>>>>> latency and good query >>>>>>>>>>>>>>>>>>>>>>>>> performance. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Equality Deletes were added since there isn't a >>>>>>>>>>>>>>>>>>>>>>>>> good way to solve frequent updates otherwise. It >>>>>>>>>>>>>>>>>>>>>>>>> would require some sort of >>>>>>>>>>>>>>>>>>>>>>>>> index keeping track of every record in the table (by >>>>>>>>>>>>>>>>>>>>>>>>> a predetermined PK) >>>>>>>>>>>>>>>>>>>>>>>>> and maintaining such an index is a huge task that >>>>>>>>>>>>>>>>>>>>>>>>> every tool interested in >>>>>>>>>>>>>>>>>>>>>>>>> this would need to re-implement. It also becomes a >>>>>>>>>>>>>>>>>>>>>>>>> bottleneck limiting >>>>>>>>>>>>>>>>>>>>>>>>> table sizes. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I don't think they should be removed without >>>>>>>>>>>>>>>>>>>>>>>>> providing an alternative. Positional Deletes have a >>>>>>>>>>>>>>>>>>>>>>>>> different performance >>>>>>>>>>>>>>>>>>>>>>>>> profile inherently, requiring more upfront work >>>>>>>>>>>>>>>>>>>>>>>>> proportional to the table >>>>>>>>>>>>>>>>>>>>>>>>> size. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 2:45 PM Jean-Baptiste >>>>>>>>>>>>>>>>>>>>>>>>> Onofré <j...@nanthrax.net> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Russell >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the nice writeup and the proposal. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I agree with your analysis, and I have the same >>>>>>>>>>>>>>>>>>>>>>>>>> feeling. However, I >>>>>>>>>>>>>>>>>>>>>>>>>> think there are more than Flink that write >>>>>>>>>>>>>>>>>>>>>>>>>> equality delete files. So, >>>>>>>>>>>>>>>>>>>>>>>>>> I agree to deprecate in V3, but maybe be more >>>>>>>>>>>>>>>>>>>>>>>>>> "flexible" about removal >>>>>>>>>>>>>>>>>>>>>>>>>> in V4 in order to give time to engines to update. >>>>>>>>>>>>>>>>>>>>>>>>>> I think that by deprecating equality deletes, we >>>>>>>>>>>>>>>>>>>>>>>>>> are clearly focusing >>>>>>>>>>>>>>>>>>>>>>>>>> on read performance and "consistency" (more than >>>>>>>>>>>>>>>>>>>>>>>>>> write). It's not >>>>>>>>>>>>>>>>>>>>>>>>>> necessarily a bad thing but the streaming >>>>>>>>>>>>>>>>>>>>>>>>>> platform and data ingestion >>>>>>>>>>>>>>>>>>>>>>>>>> platforms will be probably concerned about that >>>>>>>>>>>>>>>>>>>>>>>>>> (by using positional >>>>>>>>>>>>>>>>>>>>>>>>>> deletes, they will have to scan/read all >>>>>>>>>>>>>>>>>>>>>>>>>> datafiles to find the >>>>>>>>>>>>>>>>>>>>>>>>>> position, so painful). >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> So, to summarize: >>>>>>>>>>>>>>>>>>>>>>>>>> 1. Agree to deprecate equality deletes, but -1 to >>>>>>>>>>>>>>>>>>>>>>>>>> commit any target >>>>>>>>>>>>>>>>>>>>>>>>>> for deletion before having a clear path for >>>>>>>>>>>>>>>>>>>>>>>>>> streaming platforms >>>>>>>>>>>>>>>>>>>>>>>>>> (Flink, Beam, ...) >>>>>>>>>>>>>>>>>>>>>>>>>> 2. In the meantime (during the deprecation >>>>>>>>>>>>>>>>>>>>>>>>>> period), I propose to >>>>>>>>>>>>>>>>>>>>>>>>>> explore possible improvements for streaming >>>>>>>>>>>>>>>>>>>>>>>>>> platforms (maybe finding a >>>>>>>>>>>>>>>>>>>>>>>>>> way to avoid full data files scan, ...) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks ! >>>>>>>>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>>>>>>>> JB >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Oct 30, 2024 at 10:06 PM Russell Spitzer >>>>>>>>>>>>>>>>>>>>>>>>>> <russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Background: >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > 1) Position Deletes >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Writers determine what rows are deleted and >>>>>>>>>>>>>>>>>>>>>>>>>> mark them in a 1 for 1 representation. With delete >>>>>>>>>>>>>>>>>>>>>>>>>> vectors this means every >>>>>>>>>>>>>>>>>>>>>>>>>> data file has at most 1 delete vector that it is >>>>>>>>>>>>>>>>>>>>>>>>>> read in conjunction with >>>>>>>>>>>>>>>>>>>>>>>>>> to excise deleted rows. Reader overhead is more or >>>>>>>>>>>>>>>>>>>>>>>>>> less constant and is >>>>>>>>>>>>>>>>>>>>>>>>>> very predictable. >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > The main cost of this mode is that deletes must >>>>>>>>>>>>>>>>>>>>>>>>>> be determined at write time which is expensive and >>>>>>>>>>>>>>>>>>>>>>>>>> can be more difficult >>>>>>>>>>>>>>>>>>>>>>>>>> for conflict resolution >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > 2) Equality Deletes >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Writers write out reference to what values are >>>>>>>>>>>>>>>>>>>>>>>>>> deleted (in a partition or globally). There can be >>>>>>>>>>>>>>>>>>>>>>>>>> an unlimited number of >>>>>>>>>>>>>>>>>>>>>>>>>> equality deletes and they all must be checked for >>>>>>>>>>>>>>>>>>>>>>>>>> every data file that is >>>>>>>>>>>>>>>>>>>>>>>>>> read. The cost of determining deleted rows is >>>>>>>>>>>>>>>>>>>>>>>>>> essentially given to the >>>>>>>>>>>>>>>>>>>>>>>>>> reader. >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Conflicts almost never happen since data files >>>>>>>>>>>>>>>>>>>>>>>>>> are not actually changed and there is almost no cost >>>>>>>>>>>>>>>>>>>>>>>>>> to the writer to >>>>>>>>>>>>>>>>>>>>>>>>>> generate these. Almost all costs related to equality >>>>>>>>>>>>>>>>>>>>>>>>>> deletes are passed on >>>>>>>>>>>>>>>>>>>>>>>>>> to the reader. >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Proposal: >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Equality deletes are, in my opinion, >>>>>>>>>>>>>>>>>>>>>>>>>> unsustainable and we should work on deprecating and >>>>>>>>>>>>>>>>>>>>>>>>>> removing them from the >>>>>>>>>>>>>>>>>>>>>>>>>> specification. At this time, I know of only one >>>>>>>>>>>>>>>>>>>>>>>>>> engine (Apache Flink) which >>>>>>>>>>>>>>>>>>>>>>>>>> produces these deletes but almost all engines have >>>>>>>>>>>>>>>>>>>>>>>>>> implementations to read >>>>>>>>>>>>>>>>>>>>>>>>>> them. The cost of implementing equality deletes on >>>>>>>>>>>>>>>>>>>>>>>>>> the read path is >>>>>>>>>>>>>>>>>>>>>>>>>> difficult and unpredictable in terms of memory usage >>>>>>>>>>>>>>>>>>>>>>>>>> and compute >>>>>>>>>>>>>>>>>>>>>>>>>> complexity. We’ve had suggestions of implementing >>>>>>>>>>>>>>>>>>>>>>>>>> rocksdb inorder to handle >>>>>>>>>>>>>>>>>>>>>>>>>> ever growing sets of equality deletes which in my >>>>>>>>>>>>>>>>>>>>>>>>>> opinion shows that we are >>>>>>>>>>>>>>>>>>>>>>>>>> going down the wrong path. >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Outside of performance, Equality deletes are >>>>>>>>>>>>>>>>>>>>>>>>>> also difficult to use in conjunction with many other >>>>>>>>>>>>>>>>>>>>>>>>>> features. For example, >>>>>>>>>>>>>>>>>>>>>>>>>> any features requiring CDC or Row lineage are >>>>>>>>>>>>>>>>>>>>>>>>>> basically impossible when >>>>>>>>>>>>>>>>>>>>>>>>>> equality deletes are in use. When Equality deletes >>>>>>>>>>>>>>>>>>>>>>>>>> are present, the state >>>>>>>>>>>>>>>>>>>>>>>>>> of the table can only be determined with a full scan >>>>>>>>>>>>>>>>>>>>>>>>>> making it difficult to >>>>>>>>>>>>>>>>>>>>>>>>>> update differential structures. This means >>>>>>>>>>>>>>>>>>>>>>>>>> materialized views or indexes >>>>>>>>>>>>>>>>>>>>>>>>>> need to essentially be fully rebuilt whenever an >>>>>>>>>>>>>>>>>>>>>>>>>> equality delete is added >>>>>>>>>>>>>>>>>>>>>>>>>> to the table. >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Equality deletes essentially remove complexity >>>>>>>>>>>>>>>>>>>>>>>>>> from the write side but then add what I believe is >>>>>>>>>>>>>>>>>>>>>>>>>> an unacceptable level of >>>>>>>>>>>>>>>>>>>>>>>>>> complexity to the read side. >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Because of this I suggest we deprecate Equality >>>>>>>>>>>>>>>>>>>>>>>>>> Deletes in V3 and slate them for full removal from >>>>>>>>>>>>>>>>>>>>>>>>>> the Iceberg Spec in V4. >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > I know this is a big change and compatibility >>>>>>>>>>>>>>>>>>>>>>>>>> breakage so I would like to introduce this idea to >>>>>>>>>>>>>>>>>>>>>>>>>> the community and >>>>>>>>>>>>>>>>>>>>>>>>>> solicit feedback from all stakeholders. I am very >>>>>>>>>>>>>>>>>>>>>>>>>> flexible on this issue >>>>>>>>>>>>>>>>>>>>>>>>>> and would like to hear the best issues both for and >>>>>>>>>>>>>>>>>>>>>>>>>> against removal of >>>>>>>>>>>>>>>>>>>>>>>>>> Equality Deletes. >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Thanks everyone for your time, >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > Russ Spitzer >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> *Jason Fine* >>>>>>>>>>>>>>>>>>>>>>>>> Chief Software Architect >>>>>>>>>>>>>>>>>>>>>>>>> ja...@upsolver.com | www.upsolver.com >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>