Good points Micah,

A few additional points:
- Russell mentioned partial updates as a possible feature
- I would like to see that it is possible to read the current content of
the table (with all of the committed changes) without an engine - no
complicated joins / high amount of data in memory / etc



Micah Kornfield <emkornfi...@gmail.com> ezt írta (időpont: 2024. nov. 20.,
Sze, 0:56):

> The key here is that you only use position deletes on your delta table
>> which you keep small, say 1gb or less.
>
>
> Would this cause issues operationally for either a high enough sustained
> throughput of streamed data or if the maintenance process of moving the
> data out of the delta table has an outage?
>
> - How would the solution handle the double updates?
>> All updates to the delta table are merge/upserts on primary keys.
>> Multiple updates are just like normal updates on a small table. Scan and
>> add position deletes
>
>
> I'm not sure if this is what Peter is asking but I think position delete
> effectively requires that updates for a single row be isolated in a single
> snapshot?  This seems like it is potentially difficult to implement in a
> distributed context.
>
> I think Jack's document does a good job [1] does a good job outlining a
> lot of the challenges/trade-offs.  Before searching for a specific solution
> it might be worth agreeing on the requirements for upsert/CDC use-cases.
> Off the top of my head I suggest a few (these are of open to debate and not
> complete):
>
> 1.  Inserts latency for the table O(data inserted) and does not depend on
> the amount of data already in the table.
> 2.  Writers should be able to write multiple row mutations in a single
> snapshot (with some form of additional disambiguator to identify ordering).
> 3.  Upserts require a table that has a primary key.
> 4.  Readers should be able to trade-off query costs vs data freshness
> (e.g. like BigQuery's max-staleness
> <https://cloud.google.com/bigquery/docs/change-data-capture> [2] or
> Hudi's query types
> <https://hudi.apache.org/docs/next/table_types/#query-types> [3]) and
> inspect the tables to understand the trade-off.
> 5.  Upserts operations should be compatible with Row identifiers.
>
> Thanks,
> Micah
>
> [1]
> https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit?tab=t.0
> [2] https://cloud.google.com/bigquery/docs/change-data-capture
> [3] https://hudi.apache.org/docs/next/table_types/#query-types
>
>
>
> On Tue, Nov 19, 2024 at 10:21 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> - How would the Delta table look like?
>> Delta Table is just another Iceberg table with the exact same schema as
>> the base table (it could possibly skip partitioning since we expect it to
>> stay very small)
>>
>> - Would it just contain the whole new record?
>> It could, doesn't have to. The key is that any values in the Delta table
>> are selected over any records in the base tab
>>
>> - How would the solution handle the double updates?
>> All updates to the delta table are merge/upserts on primary keys.
>> Multiple updates are just like normal updates on a small table. Scan and
>> add position deletes
>>
>> - Would it just  write a second version of the record to the Delta table?
>> No (at least not in my plan)
>>
>> - How would the solution handle Deletes?
>> This is debatable. We can add a tombstone cell to the schema (Ie if this
>> cell is set ,  ignore all values in base and remove key)
>>
>> - Special row, or marker for the deleted id?
>>
>>
>> If we add a solution for all of these complexity,  I'm afraid that we
>> arrive a solution which is very similar to the current one, especially if
>> we consider that some readers need to do a single step read (reading a data
>> file and emitting only the records which are still in the table)
>>
>> At best, this solution removes the need for the equality delete, but then
>> single step readers of the old data files need to read all of the new
>> records, to ensure that they don't emitt already updated data. Which is
>> worse than reading small equality deletes files.
>>
>>
>> The key here is that you only use position deletes on your delta table
>> which you keep small, say 1gb or less. Engines can (if they like) cache
>> this information and determine position deletes very easily. The key again
>> here is that we have no need for any equality deletes and the rules for
>> resolving updated rows are much more strict and easier than equality
>> deletes.  We could also allow readers who only want to do a single step
>> read to do a "last merged" read which just checks the base table. Again
>> this is better than the current situation since this would be constant time
>> rather than scaling with equality deletes.
>>
>> On Tue, Nov 19, 2024 at 11:46 AM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> Hi Team,
>>>
>>> I have a few questions about the Delta table:
>>> - How would the Delta table look like?
>>> - Would it just contain the whole new record?
>>>
>>> - How would the solution handle the double updates?
>>> - Would it just  write a second version of the record to the Delta table?
>>>
>>> - How would the solution handle Deletes?
>>> - Special row, or marker for the deleted id?
>>>
>>> If we add a solution for all of these complexity,  I'm afraid that we
>>> arrive a solution which is very similar to the current one, especially if
>>> we consider that some readers need to do a single step read (reading a data
>>> file and emitting only the records which are still in the table)
>>>
>>> At best, this solution removes the need for the equality delete, but
>>> then single step readers of the old data files need to read all of the new
>>> records, to ensure that they don't emitt already updated data. Which is
>>> worse than reading small equality deletes files.
>>>
>>> Equality deletes could also be considered as a table which should be
>>> removed from the table containing old data.
>>>
>>> Thanks, Peter
>>>
>>> On Tue, Nov 19, 2024, 17:46 Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> The proposal sounds similar to the Delta Lake CDC feature with CDC file
>>>> type [1] and CDC action [2].
>>>>
>>>> There was also the proposal I wrote a long time ago [3] to use a "cdc"
>>>> branch rather than 2 private tables, which was inspired by the Delta Lake
>>>> approach. The feedback was mixed at that time because on one side at least
>>>> the user does not need to have a 2 table setup and it is still considered
>>>> doing CDC against one Iceberg table, but on the other side branching
>>>> construct was not supported widely with enterprise level ETL and governance
>>>> features, and having 2 tables might just be cleaner after all. But we have
>>>> seen customers implementing the cdc branch approach in the proposal and it
>>>> was successful.
>>>>
>>>> Either way, in a Delta table based CDC approach, for the reader, we
>>>> could choose the view approach Russell described above, or develop a reader
>>>> that does essentially a broadcast join at scan level.
>>>>
>>>> Overall, I think we have a lot of options on the table to solve CDC in
>>>> Iceberg for both read and write.
>>>>
>>>> Given the row lineage feature is fundamentally in conflict with the
>>>> equality deletes, I would +1 for dropping equality delete support.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>> [1]
>>>> https://github.com/delta-io/delta/blob/master/PROTOCOL.md#change-data-files
>>>> [2]
>>>> https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-cdc-file
>>>> [3]
>>>> https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit?tab=t.0
>>>>
>>>>
>>>> On Tue, Nov 19, 2024 at 7:56 AM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> I'm strongly in favor of moving to the Delta + Base table approach
>>>>> discussed in the cookbook above. I wonder if we should codify that into
>>>>> something more standardized but it seems to me to be a much better path
>>>>> forward. I'm not sure we need to support his at the spec level but it 
>>>>> would
>>>>> be nice if we could provide a table that automatically was broken into sub
>>>>> tables and had well defined operations on it.
>>>>>
>>>>> For example:
>>>>>
>>>>> FastUpdateTable:
>>>>>    Requires:
>>>>>      Primary Key Columns
>>>>>      Long Max Delta Size
>>>>>    Contains:
>>>>>        Private Iceberg Table: Delta
>>>>>        Private Iceberg Table: Base
>>>>>
>>>>>    On All Scans -
>>>>>        Return a view which joins delta and base on primary key, if
>>>>> Delta has a record for a given primary key discard the base record
>>>>>
>>>>>   On All Writes -
>>>>>        Perform all writes against the delta table, only MERGE is
>>>>> allowed. Append is forbidden (No PK Guarantees) Only position deletes are
>>>>> allowed.
>>>>>
>>>>>    On Delta Table Size Max Delta Size- -
>>>>>        Upsert DELTA into BASE
>>>>>        Clear upserted records from Delta
>>>>>
>>>>>
>>>>> If the Delta Table size is kept small I think this would be almost as
>>>>> performant as Equality deletes but still be compatible with row-lineage 
>>>>> and
>>>>> other indexing features.
>>>>>
>>>>>
>>>>> On Tue, Nov 19, 2024 at 7:12 AM Manu Zhang <owenzhang1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ajantha,
>>>>>>
>>>>>> I'm proposing exploring a view-based approach similar to the
>>>>>> changelog-mirror table pattern[1] rather than supporting delta writers 
>>>>>> for
>>>>>> Kafka connect Iceberg sink.
>>>>>>
>>>>>> 1.
>>>>>> https://www.tabular.io/apache-iceberg-cookbook/data-engineering-cdc-table-mirroring/
>>>>>>
>>>>>> On Tue, Nov 19, 2024 at 7:38 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>>>> wrote:
>>>>>>
>>>>>>> I don’t think it’s a problem while an alternative is explored (the
>>>>>>> JDK itself does that pretty often).
>>>>>>> So it’s up to the community: of course I’m against removing it
>>>>>>> without solid alternative, but deprecation is fine imho.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> Le mar. 19 nov. 2024 à 12:19, Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> a écrit :
>>>>>>>
>>>>>>>> - ok for deprecate equality deletes
>>>>>>>>> - not ok to remove it
>>>>>>>>
>>>>>>>>
>>>>>>>> @JB: I don't think it is a good idea to use deprecated
>>>>>>>> functionality in the new feature development.
>>>>>>>> Hence, my specific question was about kafka connect upsert
>>>>>>>> operation.
>>>>>>>>
>>>>>>>> @Manu: I meant the delta writers for kafka connect Iceberg sink
>>>>>>>> (which in turn used for upsetting the CDC records)
>>>>>>>> https://github.com/apache/iceberg/issues/10842
>>>>>>>>
>>>>>>>>
>>>>>>>> - Ajantha
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 19, 2024 at 3:08 PM Manu Zhang <owenzhang1...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I second Anton's proposal to standardize on a view-based approach
>>>>>>>>> to handle CDC cases.
>>>>>>>>> Actually, it's already been explored in detail[1] by Jack before.
>>>>>>>>>
>>>>>>>>> [1] Improving Change Data Capture Use Case for Apache Iceberg
>>>>>>>>> <https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit?tab=t.0#heading=h.94xnx4qg3bnt>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Nov 19, 2024 at 4:16 PM Jean-Baptiste Onofré <
>>>>>>>>> j...@nanthrax.net> wrote:
>>>>>>>>>
>>>>>>>>>> My proposal is the following (already expressed):
>>>>>>>>>> - ok for deprecate equality deletes
>>>>>>>>>> - not ok to remove it
>>>>>>>>>> - work on position deletes improvements to address streaming use
>>>>>>>>>> cases. I think we should explore different approaches. Personally I 
>>>>>>>>>> think a
>>>>>>>>>> possible approach would be to find index way to data files to avoid 
>>>>>>>>>> full
>>>>>>>>>> scan to find row position.
>>>>>>>>>>
>>>>>>>>>> My $0.01 :)
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>>
>>>>>>>>>> Le mar. 19 nov. 2024 à 07:53, Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>>>>> a écrit :
>>>>>>>>>>
>>>>>>>>>>> Hi, What's the conclusion on this thread?
>>>>>>>>>>>
>>>>>>>>>>> Users are looking for Upsert (CDC) support for OSS Iceberg
>>>>>>>>>>> kafka connect sink.
>>>>>>>>>>> We only support appends at the moment. Can we go ahead and
>>>>>>>>>>> implement the upserts using equality deletes?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> - Ajantha
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Nov 10, 2024 at 11:56 AM Vignesh <vignesh.v...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> I am reading about iceberg and am quite new to this.
>>>>>>>>>>>> This puffin would be an index from key to data file. Other use
>>>>>>>>>>>> cases of Puffin, such as statistics are at a per file level if I 
>>>>>>>>>>>> understand
>>>>>>>>>>>> correctly.
>>>>>>>>>>>>
>>>>>>>>>>>> Where would the puffin about key->data file be stored? It is a
>>>>>>>>>>>> property of the entire table.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Vignesh.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Nov 9, 2024 at 2:17 AM Shani Elharrar
>>>>>>>>>>>> <sh...@upsolver.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> JB, this is what we do, we write Equality Deletes and
>>>>>>>>>>>>> periodically convert them to Positional Deletes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We could probably index the keys, maybe partially index using
>>>>>>>>>>>>> bloom filters, the best would be to put those bloom filters 
>>>>>>>>>>>>> inside puffin.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shani.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 9 Nov 2024, at 11:11, Jean-Baptiste Onofré <j...@nanthrax.net>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree with Peter here, and I would say that it would be an
>>>>>>>>>>>>> issue for multi-engine support.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think, as I already mentioned with others, we should explore
>>>>>>>>>>>>> an alternative.
>>>>>>>>>>>>> As the main issue is the datafile scan in streaming context,
>>>>>>>>>>>>> maybe we could find a way to "index"/correlate for positional 
>>>>>>>>>>>>> deletes with
>>>>>>>>>>>>> limited scanning.
>>>>>>>>>>>>> I will think again about that :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> JB
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Nov 9, 2024 at 6:48 AM Péter Váry <
>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Imran,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think it's a good idea to start creating multiple
>>>>>>>>>>>>>> types of Iceberg tables. Iceberg's main selling point is 
>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>> between engines. If we don't have readers and writers for all 
>>>>>>>>>>>>>> types of
>>>>>>>>>>>>>> tables, then we remove compatibility from the equation and 
>>>>>>>>>>>>>> engine specific
>>>>>>>>>>>>>> formats always win. OTOH, if we write readers and writers for 
>>>>>>>>>>>>>> all types of
>>>>>>>>>>>>>> tables then we are back on square one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Identifier fields are a table schema concept and used in many
>>>>>>>>>>>>>> cases during query planning and execution. This is why they are 
>>>>>>>>>>>>>> defined as
>>>>>>>>>>>>>> part of the SQL spec, and this is why Iceberg defines them as 
>>>>>>>>>>>>>> well. One use
>>>>>>>>>>>>>> case is where they can be used to merge deletes (independently 
>>>>>>>>>>>>>> of how they
>>>>>>>>>>>>>> are manifested) and subsequent inserts, into updates.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Flink SQL doesn't allow creating tables with partition
>>>>>>>>>>>>>> transforms, so no new table could be created by Flink SQL using 
>>>>>>>>>>>>>> transforms,
>>>>>>>>>>>>>> but tables created by other engines could still be used (both 
>>>>>>>>>>>>>> read an
>>>>>>>>>>>>>> write). Also you can create such tables in Flink using the Java 
>>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Requiring partition columns be part of the identifier fields
>>>>>>>>>>>>>> is coming from the practical consideration, that you want to 
>>>>>>>>>>>>>> limit the
>>>>>>>>>>>>>> scope of the equality deletes as much as possible. Otherwise all 
>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>> equality deletes should be table global, and they should be read 
>>>>>>>>>>>>>> by every
>>>>>>>>>>>>>> reader. We could write those, we just decided that we don't want 
>>>>>>>>>>>>>> to allow
>>>>>>>>>>>>>> the user to do this, as it is most cases a bad idea.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I hope this helps,
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Nov 8, 2024, 22:01 Imran Rashid
>>>>>>>>>>>>>> <iras...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not down in the weeds at all myself on implementation
>>>>>>>>>>>>>>> details, so forgive me if I'm wrong about the details here.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can see all the viewpoints -- both that equality deletes
>>>>>>>>>>>>>>> enable some use cases, but also make others far more difficult.
>>>>>>>>>>>>>>> What surprised me the most is that Iceberg does not provide a 
>>>>>>>>>>>>>>> way to
>>>>>>>>>>>>>>> distinguish these two table "types".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> At first, I thought the presence of an identifier-field (
>>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#identifier-field-ids)
>>>>>>>>>>>>>>> indicated that the table was a target for equality deletes.  
>>>>>>>>>>>>>>> But, then it
>>>>>>>>>>>>>>> turns out identifier-fields are also useful for changelog views 
>>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>> without equality deletes -- IIUC, they show that a delete + 
>>>>>>>>>>>>>>> insert should
>>>>>>>>>>>>>>> actually be interpreted as an update in changelog view.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To be perfectly honest, I'm confused about all of these
>>>>>>>>>>>>>>> details -- from my read, the spec does not indicate this 
>>>>>>>>>>>>>>> relationship
>>>>>>>>>>>>>>> between identifier-fields and equality_ids in equality delete 
>>>>>>>>>>>>>>> files (
>>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#equality-delete-files),
>>>>>>>>>>>>>>> but I think that is the way Flink works.  Flink itself seems to 
>>>>>>>>>>>>>>> have even
>>>>>>>>>>>>>>> more limitations -- no partition transforms are allowed, and 
>>>>>>>>>>>>>>> all partition
>>>>>>>>>>>>>>> columns must be a subset of the identifier fields.  Is that 
>>>>>>>>>>>>>>> just a Flink
>>>>>>>>>>>>>>> limitation, or is that the intended behavior in the spec?  (Or 
>>>>>>>>>>>>>>> maybe
>>>>>>>>>>>>>>> user-error on my part?)  Those seem like very reasonable 
>>>>>>>>>>>>>>> limitations, from
>>>>>>>>>>>>>>> an implementation point-of-view.  But OTOH, as a user, this 
>>>>>>>>>>>>>>> seems to be
>>>>>>>>>>>>>>> directly contrary to some of the promises of Iceberg.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Its easy to see if a table already has equality deletes in
>>>>>>>>>>>>>>> it, by looking at the metadata.  But is there any way to 
>>>>>>>>>>>>>>> indicate that a
>>>>>>>>>>>>>>> table (or branch of a table) _must not_ have equality deletes 
>>>>>>>>>>>>>>> added to it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If that were possible, it seems like we could support both
>>>>>>>>>>>>>>> use cases.  We could continue to optimize for the streaming 
>>>>>>>>>>>>>>> ingestion use
>>>>>>>>>>>>>>> cases using equality deletes.  But we could also build more 
>>>>>>>>>>>>>>> optimizations
>>>>>>>>>>>>>>> into the "non-streaming-ingestion" branches.  And we could 
>>>>>>>>>>>>>>> document the
>>>>>>>>>>>>>>> tradeoff so it is much clearer to end users.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To maintain compatibility, I suppose that the change would
>>>>>>>>>>>>>>> be that equality deletes continue to be allowed by default, but 
>>>>>>>>>>>>>>> we'd add a
>>>>>>>>>>>>>>> new field to indicate that for some tables (or branches of a 
>>>>>>>>>>>>>>> table),
>>>>>>>>>>>>>>> equality deletes would not be allowed.  And it would be an 
>>>>>>>>>>>>>>> error for an
>>>>>>>>>>>>>>> engine to make an update which added an equality delete to such 
>>>>>>>>>>>>>>> a table.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maybe that change would even be possible in V3.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And if all the performance improvements to equality deletes
>>>>>>>>>>>>>>> make this a moot point, we could drop the field in v4.  But it 
>>>>>>>>>>>>>>> seems like a
>>>>>>>>>>>>>>> mistake to both limit the non-streaming use-case AND have 
>>>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>> limitations for the end-user in the meantime.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would happily be corrected about my understanding of all
>>>>>>>>>>>>>>> of the above.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> thanks!
>>>>>>>>>>>>>>> Imran
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Nov 5, 2024 at 9:16 AM Bryan Keller <
>>>>>>>>>>>>>>> brya...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I also feel we should keep equality deletes until we have
>>>>>>>>>>>>>>>> an alternative solution for streaming updates/deletes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Bryan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Nov 4, 2024, at 8:33 AM, Péter Váry <
>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Well, it seems like I'm a little late, so most of the
>>>>>>>>>>>>>>>> arguments are voiced.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree that we should not deprecate the equality deletes
>>>>>>>>>>>>>>>> until we have a replacement feature.
>>>>>>>>>>>>>>>> I think one of the big advantages of Iceberg is that it
>>>>>>>>>>>>>>>> supports batch processing and streaming ingestion too.
>>>>>>>>>>>>>>>> For streaming ingestion we need a way to update existing
>>>>>>>>>>>>>>>> data in a performant way, but restricting deletes for the 
>>>>>>>>>>>>>>>> primary keys
>>>>>>>>>>>>>>>> seems like enough from the streaming perspective.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Equality deletes allow a very wide range of applications,
>>>>>>>>>>>>>>>> which we might be able to narrow down a bit, but still keep 
>>>>>>>>>>>>>>>> useful. So if
>>>>>>>>>>>>>>>> we want to go down this road, we need to start collecting the 
>>>>>>>>>>>>>>>> requirements.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Shani Elharrar <sh...@upsolver.com.invalid> ezt írta
>>>>>>>>>>>>>>>> (időpont: 2024. nov. 1., P, 19:22):
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I understand how it makes sense for batch jobs, but it
>>>>>>>>>>>>>>>>> damages stream jobs, using equality deletes works much better 
>>>>>>>>>>>>>>>>> for streaming
>>>>>>>>>>>>>>>>> (which have a strict SLA for delays), and in order to 
>>>>>>>>>>>>>>>>> decrease the
>>>>>>>>>>>>>>>>> performance penalty - systems can rewrite the equality 
>>>>>>>>>>>>>>>>> deletes to
>>>>>>>>>>>>>>>>> positional deletes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Shani.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 1 Nov 2024, at 20:06, Steven Wu <stevenz...@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Fundamentally, it is very difficult to write position
>>>>>>>>>>>>>>>>> deletes with concurrent writers and conflicts for batch jobs 
>>>>>>>>>>>>>>>>> too, as the
>>>>>>>>>>>>>>>>> inverted index may become invalid/stale.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The position deletes are created during the write phase.
>>>>>>>>>>>>>>>>> But conflicts are only detected at the commit stage. I assume 
>>>>>>>>>>>>>>>>> the batch job
>>>>>>>>>>>>>>>>> should fail in this case.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Nov 1, 2024 at 10:57 AM Steven Wu <
>>>>>>>>>>>>>>>>> stevenz...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Shani,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That is a good point. It is certainly a limitation for
>>>>>>>>>>>>>>>>>> the Flink job to track the inverted index internally (which 
>>>>>>>>>>>>>>>>>> is what I had
>>>>>>>>>>>>>>>>>> in mind). It can't be shared/synchronized with other Flink 
>>>>>>>>>>>>>>>>>> jobs or other
>>>>>>>>>>>>>>>>>> engines writing to the same table.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Nov 1, 2024 at 10:50 AM Shani Elharrar
>>>>>>>>>>>>>>>>>> <sh...@upsolver.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Even if Flink can create this state, it would have to be
>>>>>>>>>>>>>>>>>>> maintained against the Iceberg table, we wouldn't like 
>>>>>>>>>>>>>>>>>>> duplicates (keys) if
>>>>>>>>>>>>>>>>>>> other systems / users update the table (e.g manual insert / 
>>>>>>>>>>>>>>>>>>> updates using
>>>>>>>>>>>>>>>>>>> DML).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Shani.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 1 Nov 2024, at 18:32, Steven Wu <stevenz...@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> > Add support for inverted indexes to reduce the cost of
>>>>>>>>>>>>>>>>>>> position lookup. This is fairly tricky to implement for 
>>>>>>>>>>>>>>>>>>> streaming use cases
>>>>>>>>>>>>>>>>>>> without an external system.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Anton, that is also what I was saying earlier. In Flink,
>>>>>>>>>>>>>>>>>>> the inverted index of (key, committed data files) can be 
>>>>>>>>>>>>>>>>>>> tracked in Flink
>>>>>>>>>>>>>>>>>>> state.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi <
>>>>>>>>>>>>>>>>>>> aokolnyc...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I was a bit skeptical when we were adding equality
>>>>>>>>>>>>>>>>>>>> deletes, but nothing beats their performance during 
>>>>>>>>>>>>>>>>>>>> writes. We have to find
>>>>>>>>>>>>>>>>>>>> an alternative before deprecating.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We are doing a lot of work to improve streaming, like
>>>>>>>>>>>>>>>>>>>> reducing the cost of commits, enabling a large 
>>>>>>>>>>>>>>>>>>>> (potentially infinite)
>>>>>>>>>>>>>>>>>>>> number of snapshots, changelog reads, and so on. It is a 
>>>>>>>>>>>>>>>>>>>> project goal to
>>>>>>>>>>>>>>>>>>>> excel in streaming.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I was going to focus on equality deletes after
>>>>>>>>>>>>>>>>>>>> completing the DV work. I believe we have these options:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - Revisit the existing design of equality deletes (e.g.
>>>>>>>>>>>>>>>>>>>> add more restrictions, improve compaction, offer new 
>>>>>>>>>>>>>>>>>>>> writers).
>>>>>>>>>>>>>>>>>>>> - Standardize on the view-based approach [1] to handle
>>>>>>>>>>>>>>>>>>>> streaming upserts and CDC use cases, potentially making 
>>>>>>>>>>>>>>>>>>>> this part of the
>>>>>>>>>>>>>>>>>>>> spec.
>>>>>>>>>>>>>>>>>>>> - Add support for inverted indexes to reduce the cost
>>>>>>>>>>>>>>>>>>>> of position lookup. This is fairly tricky to implement for 
>>>>>>>>>>>>>>>>>>>> streaming use
>>>>>>>>>>>>>>>>>>>> cases without an external system. Our runtime filtering in 
>>>>>>>>>>>>>>>>>>>> Spark today is
>>>>>>>>>>>>>>>>>>>> equivalent to looking up positions in an inverted index 
>>>>>>>>>>>>>>>>>>>> represented by
>>>>>>>>>>>>>>>>>>>> another Iceberg table. That may still not be enough for 
>>>>>>>>>>>>>>>>>>>> some streaming use
>>>>>>>>>>>>>>>>>>>> cases.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [1] - https://www.tabular.io/blog/hello-world-of-cdc/
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> чт, 31 жовт. 2024 р. о 21:31 Micah Kornfield <
>>>>>>>>>>>>>>>>>>>> emkornfi...@gmail.com> пише:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I agree that equality deletes have their place in
>>>>>>>>>>>>>>>>>>>>> streaming.  I think the ultimate decision here is how 
>>>>>>>>>>>>>>>>>>>>> opinionated
>>>>>>>>>>>>>>>>>>>>> Iceberg wants to be on its use-cases.  If it really wants 
>>>>>>>>>>>>>>>>>>>>> to stick to its
>>>>>>>>>>>>>>>>>>>>> origins of "slow moving data", then removing equality 
>>>>>>>>>>>>>>>>>>>>> deletes would be
>>>>>>>>>>>>>>>>>>>>> inline with this.  I think the other high level question 
>>>>>>>>>>>>>>>>>>>>> is how much we
>>>>>>>>>>>>>>>>>>>>> allow for partially compatible features (the row lineage 
>>>>>>>>>>>>>>>>>>>>> use-case feature
>>>>>>>>>>>>>>>>>>>>> was explicitly approved excluding equality deletes, and 
>>>>>>>>>>>>>>>>>>>>> people seemed OK
>>>>>>>>>>>>>>>>>>>>> with it at the time.  If all features need to work 
>>>>>>>>>>>>>>>>>>>>> together, then maybe we
>>>>>>>>>>>>>>>>>>>>> need to rethink the design here so it can be forward 
>>>>>>>>>>>>>>>>>>>>> compatible with
>>>>>>>>>>>>>>>>>>>>> equality deletes).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I think one issue with equality deletes as stated in
>>>>>>>>>>>>>>>>>>>>> the spec is that they are overly broad.  I'd be 
>>>>>>>>>>>>>>>>>>>>> interested if people have
>>>>>>>>>>>>>>>>>>>>> any use cases that differ, but I think one way of 
>>>>>>>>>>>>>>>>>>>>> narrowing (and probably a
>>>>>>>>>>>>>>>>>>>>> necessary building block for building something better)  
>>>>>>>>>>>>>>>>>>>>> the specification
>>>>>>>>>>>>>>>>>>>>> scope on equality deletes is to focus on upsert/Streaming 
>>>>>>>>>>>>>>>>>>>>> deletes.  Two
>>>>>>>>>>>>>>>>>>>>> proposals in this regard are:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1.  Require that equality deletes can only correspond
>>>>>>>>>>>>>>>>>>>>> to unique identifiers for the table.
>>>>>>>>>>>>>>>>>>>>> 2.  Consider requiring that for equality deletes on
>>>>>>>>>>>>>>>>>>>>> partitioned tables, that the primary key must contain a 
>>>>>>>>>>>>>>>>>>>>> partition column (I
>>>>>>>>>>>>>>>>>>>>> believe Flink at least already does this).  It is less 
>>>>>>>>>>>>>>>>>>>>> clear to me that
>>>>>>>>>>>>>>>>>>>>> this would meet all existing use-cases.  But having this 
>>>>>>>>>>>>>>>>>>>>> would allow for
>>>>>>>>>>>>>>>>>>>>> better incremental data-structures, which could then be 
>>>>>>>>>>>>>>>>>>>>> partition based.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Narrow scope to unique identifiers would allow for
>>>>>>>>>>>>>>>>>>>>> further building blocks already mentioned, like a 
>>>>>>>>>>>>>>>>>>>>> secondary index (possible
>>>>>>>>>>>>>>>>>>>>> via LSM tree), that would allow for better performance 
>>>>>>>>>>>>>>>>>>>>> overall.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I generally agree with the sentiment that we shouldn't
>>>>>>>>>>>>>>>>>>>>> deprecate them until there is a viable replacement.  With 
>>>>>>>>>>>>>>>>>>>>> all due respect
>>>>>>>>>>>>>>>>>>>>> to my employer, let's not fall into the Google trap [1] :)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [1] https://goomics.net/50/
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo <
>>>>>>>>>>>>>>>>>>>>> alex...@starburstdata.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Just to throw my 2 cents in, I agree with Steven and
>>>>>>>>>>>>>>>>>>>>>> others that we do need some kind of replacement before 
>>>>>>>>>>>>>>>>>>>>>> deprecating equality
>>>>>>>>>>>>>>>>>>>>>> deletes.
>>>>>>>>>>>>>>>>>>>>>> They certainly have their problems, and do
>>>>>>>>>>>>>>>>>>>>>> significantly increase complexity as they are now, but 
>>>>>>>>>>>>>>>>>>>>>> the writing of
>>>>>>>>>>>>>>>>>>>>>> position deletes is too expensive for certain pipelines.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> We've been investigating using equality deletes for
>>>>>>>>>>>>>>>>>>>>>> some of our workloads at Starburst, the key advantage we 
>>>>>>>>>>>>>>>>>>>>>> were hoping to
>>>>>>>>>>>>>>>>>>>>>> leverage is cheap, effectively random access lookup 
>>>>>>>>>>>>>>>>>>>>>> deletes.
>>>>>>>>>>>>>>>>>>>>>> Say you have a UUID column that's unique in a table
>>>>>>>>>>>>>>>>>>>>>> and want to delete a row by UUID. With position deletes 
>>>>>>>>>>>>>>>>>>>>>> each delete is
>>>>>>>>>>>>>>>>>>>>>> expensive without an index on that UUID.
>>>>>>>>>>>>>>>>>>>>>> With equality deletes each delete is cheap and while
>>>>>>>>>>>>>>>>>>>>>> reads/compaction is expensive but when updates are 
>>>>>>>>>>>>>>>>>>>>>> frequent and reads are
>>>>>>>>>>>>>>>>>>>>>> sporadic that's a reasonable tradeoff.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Pretty much what Jason and Steven have already said.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Maybe there are some incremental improvements on
>>>>>>>>>>>>>>>>>>>>>> equality deletes or tips from similar systems that might 
>>>>>>>>>>>>>>>>>>>>>> alleviate some of
>>>>>>>>>>>>>>>>>>>>>> their problems?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> - Alex Jo
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 10:58 AM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>> stevenz...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> We probably all agree with the downside of equality
>>>>>>>>>>>>>>>>>>>>>>> deletes: it postpones all the work on the read path.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> In theory, we can implement position deletes only in
>>>>>>>>>>>>>>>>>>>>>>> the Flink streaming writer. It would require the 
>>>>>>>>>>>>>>>>>>>>>>> tracking of last committed
>>>>>>>>>>>>>>>>>>>>>>> data files per key, which can be stored in Flink state 
>>>>>>>>>>>>>>>>>>>>>>> (checkpointed). This
>>>>>>>>>>>>>>>>>>>>>>> is obviously quite expensive/challenging, but possible.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I like to echo one benefit of equality deletes that
>>>>>>>>>>>>>>>>>>>>>>> Russel called out in the original email. Equality 
>>>>>>>>>>>>>>>>>>>>>>> deletes would never
>>>>>>>>>>>>>>>>>>>>>>> have conflicts. that is important for streaming writers 
>>>>>>>>>>>>>>>>>>>>>>> (Flink, Kafka
>>>>>>>>>>>>>>>>>>>>>>> connect, ...) that commit frequently (minutes or less). 
>>>>>>>>>>>>>>>>>>>>>>> Assume Flink can
>>>>>>>>>>>>>>>>>>>>>>> write position deletes only and commit every 2 minutes. 
>>>>>>>>>>>>>>>>>>>>>>> The long-running
>>>>>>>>>>>>>>>>>>>>>>> nature of streaming jobs can cause frequent commit 
>>>>>>>>>>>>>>>>>>>>>>> conflicts with
>>>>>>>>>>>>>>>>>>>>>>> background delete compaction jobs.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Overall, the streaming upsert write is not a well
>>>>>>>>>>>>>>>>>>>>>>> solved problem in Iceberg. This probably affects all 
>>>>>>>>>>>>>>>>>>>>>>> streaming engines
>>>>>>>>>>>>>>>>>>>>>>> (Flink, Kafka connect, Spark streaming, ...). We need 
>>>>>>>>>>>>>>>>>>>>>>> to come up with some
>>>>>>>>>>>>>>>>>>>>>>> better alternatives before we can deprecate equality 
>>>>>>>>>>>>>>>>>>>>>>> deletes.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 8:38 AM Russell Spitzer <
>>>>>>>>>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> For users of Equality Deletes, what are the key
>>>>>>>>>>>>>>>>>>>>>>>> benefits to Equality Deletes that you would like to 
>>>>>>>>>>>>>>>>>>>>>>>> preserve and could you
>>>>>>>>>>>>>>>>>>>>>>>> please share some concrete examples of the queries you 
>>>>>>>>>>>>>>>>>>>>>>>> want to run (and the
>>>>>>>>>>>>>>>>>>>>>>>> schemas and data sizes you would like to run them 
>>>>>>>>>>>>>>>>>>>>>>>> against) and the
>>>>>>>>>>>>>>>>>>>>>>>> latencies that would be acceptable?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 10:05 AM Jason Fine
>>>>>>>>>>>>>>>>>>>>>>>> <ja...@upsolver.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Representing Upsolver here, we also make use of
>>>>>>>>>>>>>>>>>>>>>>>>> Equality Deletes to deliver high frequency low 
>>>>>>>>>>>>>>>>>>>>>>>>> latency updates to our
>>>>>>>>>>>>>>>>>>>>>>>>> clients at scale. We have customers using them at 
>>>>>>>>>>>>>>>>>>>>>>>>> scale and demonstrating
>>>>>>>>>>>>>>>>>>>>>>>>> the need and viability. We automate the process of 
>>>>>>>>>>>>>>>>>>>>>>>>> converting them into
>>>>>>>>>>>>>>>>>>>>>>>>> positional deletes (or fully applying them) for more 
>>>>>>>>>>>>>>>>>>>>>>>>> efficient engine
>>>>>>>>>>>>>>>>>>>>>>>>> queries in the background giving our users both low 
>>>>>>>>>>>>>>>>>>>>>>>>> latency and good query
>>>>>>>>>>>>>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Equality Deletes were added since there isn't a
>>>>>>>>>>>>>>>>>>>>>>>>> good way to solve frequent updates otherwise. It 
>>>>>>>>>>>>>>>>>>>>>>>>> would require some sort of
>>>>>>>>>>>>>>>>>>>>>>>>> index keeping track of every record in the table (by 
>>>>>>>>>>>>>>>>>>>>>>>>> a predetermined PK)
>>>>>>>>>>>>>>>>>>>>>>>>> and maintaining such an index is a huge task that 
>>>>>>>>>>>>>>>>>>>>>>>>> every tool interested in
>>>>>>>>>>>>>>>>>>>>>>>>> this would need to re-implement. It also becomes a 
>>>>>>>>>>>>>>>>>>>>>>>>> bottleneck limiting
>>>>>>>>>>>>>>>>>>>>>>>>> table sizes.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I don't think they should be removed without
>>>>>>>>>>>>>>>>>>>>>>>>> providing an alternative. Positional Deletes have a 
>>>>>>>>>>>>>>>>>>>>>>>>> different performance
>>>>>>>>>>>>>>>>>>>>>>>>> profile inherently, requiring more upfront work 
>>>>>>>>>>>>>>>>>>>>>>>>> proportional to the table
>>>>>>>>>>>>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 31, 2024 at 2:45 PM Jean-Baptiste
>>>>>>>>>>>>>>>>>>>>>>>>> Onofré <j...@nanthrax.net> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Russell
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the nice writeup and the proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with your analysis, and I have the same
>>>>>>>>>>>>>>>>>>>>>>>>>> feeling. However, I
>>>>>>>>>>>>>>>>>>>>>>>>>> think there are more than Flink that write
>>>>>>>>>>>>>>>>>>>>>>>>>> equality delete files. So,
>>>>>>>>>>>>>>>>>>>>>>>>>> I agree to deprecate in V3, but maybe be more
>>>>>>>>>>>>>>>>>>>>>>>>>> "flexible" about removal
>>>>>>>>>>>>>>>>>>>>>>>>>> in V4 in order to give time to engines to update.
>>>>>>>>>>>>>>>>>>>>>>>>>> I think that by deprecating equality deletes, we
>>>>>>>>>>>>>>>>>>>>>>>>>> are clearly focusing
>>>>>>>>>>>>>>>>>>>>>>>>>> on read performance and "consistency" (more than
>>>>>>>>>>>>>>>>>>>>>>>>>> write). It's not
>>>>>>>>>>>>>>>>>>>>>>>>>> necessarily a bad thing but the streaming
>>>>>>>>>>>>>>>>>>>>>>>>>> platform and data ingestion
>>>>>>>>>>>>>>>>>>>>>>>>>> platforms will be probably concerned about that
>>>>>>>>>>>>>>>>>>>>>>>>>> (by using positional
>>>>>>>>>>>>>>>>>>>>>>>>>> deletes, they will have to scan/read all
>>>>>>>>>>>>>>>>>>>>>>>>>> datafiles to find the
>>>>>>>>>>>>>>>>>>>>>>>>>> position, so painful).
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> So, to summarize:
>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Agree to deprecate equality deletes, but -1 to
>>>>>>>>>>>>>>>>>>>>>>>>>> commit any target
>>>>>>>>>>>>>>>>>>>>>>>>>> for deletion before having a clear path for
>>>>>>>>>>>>>>>>>>>>>>>>>> streaming platforms
>>>>>>>>>>>>>>>>>>>>>>>>>> (Flink, Beam, ...)
>>>>>>>>>>>>>>>>>>>>>>>>>> 2. In the meantime (during the deprecation
>>>>>>>>>>>>>>>>>>>>>>>>>> period), I propose to
>>>>>>>>>>>>>>>>>>>>>>>>>> explore possible improvements for streaming
>>>>>>>>>>>>>>>>>>>>>>>>>> platforms (maybe finding a
>>>>>>>>>>>>>>>>>>>>>>>>>> way to avoid full data files scan, ...)
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks !
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Oct 30, 2024 at 10:06 PM Russell Spitzer
>>>>>>>>>>>>>>>>>>>>>>>>>> <russell.spit...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Background:
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > 1) Position Deletes
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Writers determine what rows are deleted and
>>>>>>>>>>>>>>>>>>>>>>>>>> mark them in a 1 for 1 representation. With delete 
>>>>>>>>>>>>>>>>>>>>>>>>>> vectors this means every
>>>>>>>>>>>>>>>>>>>>>>>>>> data file has at most 1 delete vector that it is 
>>>>>>>>>>>>>>>>>>>>>>>>>> read in conjunction with
>>>>>>>>>>>>>>>>>>>>>>>>>> to excise deleted rows. Reader overhead is more or 
>>>>>>>>>>>>>>>>>>>>>>>>>> less constant and is
>>>>>>>>>>>>>>>>>>>>>>>>>> very predictable.
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > The main cost of this mode is that deletes must
>>>>>>>>>>>>>>>>>>>>>>>>>> be determined at write time which is expensive and 
>>>>>>>>>>>>>>>>>>>>>>>>>> can be more difficult
>>>>>>>>>>>>>>>>>>>>>>>>>> for conflict resolution
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > 2) Equality Deletes
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Writers write out reference to what values are
>>>>>>>>>>>>>>>>>>>>>>>>>> deleted (in a partition or globally). There can be 
>>>>>>>>>>>>>>>>>>>>>>>>>> an unlimited number of
>>>>>>>>>>>>>>>>>>>>>>>>>> equality deletes and they all must be checked for 
>>>>>>>>>>>>>>>>>>>>>>>>>> every data file that is
>>>>>>>>>>>>>>>>>>>>>>>>>> read. The cost of determining deleted rows is 
>>>>>>>>>>>>>>>>>>>>>>>>>> essentially given to the
>>>>>>>>>>>>>>>>>>>>>>>>>> reader.
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Conflicts almost never happen since data files
>>>>>>>>>>>>>>>>>>>>>>>>>> are not actually changed and there is almost no cost 
>>>>>>>>>>>>>>>>>>>>>>>>>> to the writer to
>>>>>>>>>>>>>>>>>>>>>>>>>> generate these. Almost all costs related to equality 
>>>>>>>>>>>>>>>>>>>>>>>>>> deletes are passed on
>>>>>>>>>>>>>>>>>>>>>>>>>> to the reader.
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Proposal:
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Equality deletes are, in my opinion,
>>>>>>>>>>>>>>>>>>>>>>>>>> unsustainable and we should work on deprecating and 
>>>>>>>>>>>>>>>>>>>>>>>>>> removing them from the
>>>>>>>>>>>>>>>>>>>>>>>>>> specification. At this time, I know of only one 
>>>>>>>>>>>>>>>>>>>>>>>>>> engine (Apache Flink) which
>>>>>>>>>>>>>>>>>>>>>>>>>> produces these deletes but almost all engines have 
>>>>>>>>>>>>>>>>>>>>>>>>>> implementations to read
>>>>>>>>>>>>>>>>>>>>>>>>>> them. The cost of implementing equality deletes on 
>>>>>>>>>>>>>>>>>>>>>>>>>> the read path is
>>>>>>>>>>>>>>>>>>>>>>>>>> difficult and unpredictable in terms of memory usage 
>>>>>>>>>>>>>>>>>>>>>>>>>> and compute
>>>>>>>>>>>>>>>>>>>>>>>>>> complexity. We’ve had suggestions of implementing 
>>>>>>>>>>>>>>>>>>>>>>>>>> rocksdb inorder to handle
>>>>>>>>>>>>>>>>>>>>>>>>>> ever growing sets of equality deletes which in my 
>>>>>>>>>>>>>>>>>>>>>>>>>> opinion shows that we are
>>>>>>>>>>>>>>>>>>>>>>>>>> going down the wrong path.
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Outside of performance, Equality deletes are
>>>>>>>>>>>>>>>>>>>>>>>>>> also difficult to use in conjunction with many other 
>>>>>>>>>>>>>>>>>>>>>>>>>> features. For example,
>>>>>>>>>>>>>>>>>>>>>>>>>> any features requiring CDC or Row lineage are 
>>>>>>>>>>>>>>>>>>>>>>>>>> basically impossible when
>>>>>>>>>>>>>>>>>>>>>>>>>> equality deletes are in use. When Equality deletes 
>>>>>>>>>>>>>>>>>>>>>>>>>> are present, the state
>>>>>>>>>>>>>>>>>>>>>>>>>> of the table can only be determined with a full scan 
>>>>>>>>>>>>>>>>>>>>>>>>>> making it difficult to
>>>>>>>>>>>>>>>>>>>>>>>>>> update differential structures. This means 
>>>>>>>>>>>>>>>>>>>>>>>>>> materialized views or indexes
>>>>>>>>>>>>>>>>>>>>>>>>>> need to essentially be fully rebuilt whenever an 
>>>>>>>>>>>>>>>>>>>>>>>>>> equality delete is added
>>>>>>>>>>>>>>>>>>>>>>>>>> to the table.
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Equality deletes essentially remove complexity
>>>>>>>>>>>>>>>>>>>>>>>>>> from the write side but then add what I believe is 
>>>>>>>>>>>>>>>>>>>>>>>>>> an unacceptable level of
>>>>>>>>>>>>>>>>>>>>>>>>>> complexity to the read side.
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Because of this I suggest we deprecate Equality
>>>>>>>>>>>>>>>>>>>>>>>>>> Deletes in V3 and slate them for full removal from 
>>>>>>>>>>>>>>>>>>>>>>>>>> the Iceberg Spec in V4.
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > I know this is a big change and compatibility
>>>>>>>>>>>>>>>>>>>>>>>>>> breakage so I would like to introduce this idea to 
>>>>>>>>>>>>>>>>>>>>>>>>>> the community and
>>>>>>>>>>>>>>>>>>>>>>>>>> solicit feedback from all stakeholders. I am very 
>>>>>>>>>>>>>>>>>>>>>>>>>> flexible on this issue
>>>>>>>>>>>>>>>>>>>>>>>>>> and would like to hear the best issues both for and 
>>>>>>>>>>>>>>>>>>>>>>>>>> against removal of
>>>>>>>>>>>>>>>>>>>>>>>>>> Equality Deletes.
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Thanks everyone for your time,
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> > Russ Spitzer
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> *Jason Fine*
>>>>>>>>>>>>>>>>>>>>>>>>> Chief Software Architect
>>>>>>>>>>>>>>>>>>>>>>>>> ja...@upsolver.com  | www.upsolver.com
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Reply via email to