It would be useful to describe the types of concurrent operations that > would be supported (i.e., failed snapshotting could easily be recovered, > vs. the whole operation needing to be re-executed) vs. those that wouldn't. > Solving for unlimited concurrency cases may create way more complexity than > is necessary. >
I'd like to restate my comment a little bit. We need unique keys to make things work. They can be synthetic or not but they should not have any retrievable iceberg related data in them. The main thing I'm talking about is how you target a deletion across time. If you have a file A, and you want to delete record X in A, you define delete A.X. At the same time, another process may be compacting A into A'. In so doing, the position of A.X in A' is something other than X. At this point, the deletion needs to be rerun against A' so that we can ensure that the deletion is propagated forward. If the only thing you have is A.X, you need to have way from of getting to the same location in A'. You should be able to take the delta file that lists the delete of A.2 and apply it directly to A' without having to also consult A. If you didn't need to solve this number, then you could simply use A.X as opposed to the key of A.X in your delta files. Synthetic seems relative. If the synthetic key is client-supplied, in what > way is it relevant to Iceberg whether it is synthetic vs. natural? By > calling it synthetic within Iceberg there is a strong implication that it > is the implementation that generates it (the filename/position key suggests > that). If it's the client that supplies it, it _may_ be synthetic (from the > point of view of the overall data model; i.e. a customer key in a database > vs. a customer ID that shows up on a bill) but from Iceberg's case that > doesn't matter. Only the unicity constraint does. > I agree with the main statement here: the only real requirement is keys need to be unique across all existing snapshots. There could be two generators: one that uses an iceberg internal behavior to generate keys and one that is user definable. While there could be a third which uses an existing field (or set of fields) to define the key I think we probably should avoid implementing this as it has a whole other sets of problems that are best left outside of Iceberg's area of concern.