Re: [DISCUSS] Row Lineage Proposal

2024-09-16 Thread Russell Spitzer
One for each Table Version? Maybe worth thinking about going forwards. We a little discussion about this at the community sync up last weds and the general consensus is we just keep doing things the way we are doing them until it becomes too unwieldy, then figure out a new solution. Feel free to st

Re: [DISCUSS] Row Lineage Proposal

2024-09-13 Thread Manu Zhang
Thanks Russel. Not a question on the proposal itself, I find it a bit hard to follow and maintain all the three specs in one place. We are also publishing a unfinalized spec to the website. Would it be better to maintain the spec in a "copy-on-write" style, i.e. each spec having its own format file

Re: [DISCUSS] Row Lineage Proposal

2024-09-13 Thread Russell Spitzer
Pull Request Available, please focus any remaining comments there and we can wrap this one up https://github.com/apache/iceberg/pull/11130 On Thu, Aug 29, 2024 at 11:20 AM rdb...@gmail.com wrote: > +1 for making row lineage and equality deletes mutually exclusive. > > The idea behind equality d

Re: [DISCUSS] Row Lineage Proposal

2024-08-29 Thread rdb...@gmail.com
+1 for making row lineage and equality deletes mutually exclusive. The idea behind equality deletes is to avoid needing to read existing data in order to delete records. That doesn't fit with row lineage because the purpose of lineage is to be able to identify when a row changes by maintaining an

Re: [DISCUSS] Row Lineage Proposal

2024-08-28 Thread Anton Okolnychyi
I went through the proposal and left comments as well. Thanks for working on it, Russell! I don't see a good solution to how row lineage can work with equality deletes. If so, I would be in favor of not allowing equality deletes at all if row lineage is enabled as opposed to treating all added dat

Re: [DISCUSS] Row Lineage Proposal

2024-08-28 Thread Ryan Blue
Sounds good to me. Thanks for pushing this forward, Russell! On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer wrote: > I think folks have had a lot of good comments and since there haven't been > a lot of strong opinions I'm going to try to take what I think are the > least interesting options an

Re: [DISCUSS] Row Lineage Proposal

2024-08-27 Thread Russell Spitzer
I think folks have had a lot of good comments and since there haven't been a lot of strong opinions I'm going to try to take what I think are the least interesting options and move them into the "discarded section". Please continue to comment and let's please make sure any things that folk think ar

Re: [DISCUSS] Row Lineage Proposal

2024-08-19 Thread Ryan Blue
The situation in which you would use equality deletes is when you do not want to read the existing table data. That seems at odds with a feature like row-level tracking where you want to keep track. To me, it would be a reasonable solution to just say that equality deletes can't be used in tables w

Re: [DISCUSS] Row Lineage Proposal

2024-08-19 Thread Russell Spitzer
As far as I know Flink is actually the only engine we have at the moment that can produce Equally deletes and only Equality deletes have this specific problem. Since an equality delete can be written without actually knowing whether rows are being updated or not, it is always ambiguous as to whethe

Re: [DISCUSS] Row Lineage Proposal

2024-08-16 Thread Péter Váry
Hi Russell, As discussed offline, this would be very hard to implement with the current Flink CDC write strategies. I think this is true for every streaming writers. For tracking the previous version of the row, the streaming writer would need to scan the table. It needs to be done for every reco