Synthetic vs natural keys generated a lot of discussion internally when coming up with the proposal :)
There is lots of detail and lots of subtelty but a few things that may help explain our thinking: Scale - Iceberg is for remote storage at scale, not local disk. This means mainly operating on large batches of data, not one row at a time. One row at a time would generate RPC calls for each row, and many small files in HDFS/object stores. Concurrency - while there are multiple writers, they are usually externally coordinated (think Spark Driver for Spark executors). It’s very unlikely we would need to support completely independent writers, at least we couldn’t come up with a real world use case where that would be needed. Uniqueness - enforcing uniqueness at scale is not feasible (proovably so). If we cannot enforce it, we shouldn’t assume it. This is fine since we can have consistent behaviour even in the presence of duplicates (the document explains this in some detail) Natural keys - the key is called ‘natural’ because it’s a key that’s naturally unique in the dataset. For example combination (timestamp, eventId). Because it’s naturally unique, it’s the most selective index for the dataset, so it’s what naturally people would partition/bucket/sort by. So almost by definition, the natural key is the composite key made up of partition/bucket/sort key. It wouldn’t make logical sense for users to define a ‘natural key’ separate from that in their table definition. This key is usually naturally unique in the dataset, but even if it isn’t the behaviour is consistent (please convince yourself of this from the document) Also, at scale, it’s really only feasible to do query and update/upsert on the partition/bucket/sort key, any other access is likely a full scan of terabytes of data, on remote storage. So we mostly only really care about query/upsert on the natural key predicate. Update/upsert on non-natural key would be possible, and consitent, but it’s intrinsically very expensive (full scan), so we shouldn’t worry too much about optimizing that. > On 21 May 2019, at 19:26, Erik Wright <erik.wri...@shopify.com.invalid> wrote: > > > >> On Tue, May 21, 2019 at 2:01 PM Jacques Nadeau <jacq...@dremio.com> wrote: >> I think we just need to have further discussion about keys. Ryan said: >> >>> 3. Synthetic keys should be based on filename and position >> >> But I'm not clear there is consensus around that. I'm also not sure whether >> he means lossless inclusion, simply derived-from or something else. My >> thinking before is you must have synthetic keys in many cases since solving >> concurrency becomes challenging otherwise. > > It would be useful to describe the types of concurrent operations that would > be supported (i.e., failed snapshotting could easily be recovered, vs. the > whole operation needing to be re-executed) vs. those that wouldn't. Solving > for unlimited concurrency cases may create way more complexity than is > necessary. > >> Erik said: >>> C might not be partitioned or sorted by the natural key, either, which >>> means that finding the synthetic key can be expensive. >> >> But I'm not sure why this needs to be true. If we give control over the >> creation of synthetic key, wouldn't that resolve this issue? > > Synthetic seems relative. If the synthetic key is client-supplied, in what > way is it relevant to Iceberg whether it is synthetic vs. natural? By calling > it synthetic within Iceberg there is a strong implication that it is the > implementation that generates it (the filename/position key suggests that). > If it's the client that supplies it, it _may_ be synthetic (from the point of > view of the overall data model; i.e. a customer key in a database vs. a > customer ID that shows up on a bill) but from Iceberg's case that doesn't > matter. Only the unicity constraint does. > > On the other hand, everything _but_ the thing about filename/position can > probably be resolved. One solution could be to allow either user-specified > _or_ filename/position-based keys in delta-files. If you use the latter, > there are constraints on the types of operations you can do. If you use the > former, there are constraints on the types of concurrency that are supported > (I'm not sure about that, really, just going off of your comment above). > >> >> -- >> Jacques Nadeau >> CTO and Co-Founder, Dremio >> >> >>> On Tue, May 21, 2019 at 7:54 AM Erik Wright >>> <erik.wri...@shopify.com.invalid> wrote: >>>> On Thu, May 16, 2019 at 4:13 PM Ryan Blue <rb...@netflix.com> wrote: >>>> Replies inline. >>>> >>>>> On Thu, May 16, 2019 at 10:07 AM Erik Wright <erik.wri...@shopify.com> >>>>> wrote: >>>>> I would be happy to participate. Iceberg with merge-on-read capabilities >>>>> is a technology choice that my team is actively considering. It appears >>>>> that our scenario differs meaningfully from the one that Anton and Miguel >>>>> are considering. It would be great to take the time to compare the two >>>>> and see if there is a single implementation that can meet the needs of >>>>> each scenario. >>>> >>>> Can you be more specific about where the use cases differ meaningfully? I >>>> think that if we agree that operations on natural keys can be implemented >>>> using synthetic keys to encode deletes (#2), then everyone is aligned on >>>> the core parts of a design. We can figure out the implications of how >>>> synthetic keys are encoded, but I don't see that issue (#3) having a huge >>>> impact on use cases. So is #2 the main disagreement? >>> >>> We are mainly interested in upserts and deletes by natural key. We are not >>> interested in more powerful queries of the types mentioned in the doc. >>> >>> On the other hand, in our cases, the upserts and deletes are generated >>> upstream. In other words, I have an incremental transformation job J1 with >>> inputs A and B, producing an output C. I can generate the upserts and >>> deletes directly from my inputs, without referring to the current state of >>> C. >>> >>> C might not be partitioned or sorted by the natural key, either, which >>> means that finding the synthetic key can be expensive. >>> >>> In our bespoke model, we are able to generate a new delta (and append it to >>> the dataset) without referring to the existing base or existing deltas. >>> >>> Another apparent divergence is that my jobs are incremental and want to be >>> able to track the state they have seen so far in order to consume only new >>> deltas in their next execution. So if job J2 consumes C, it needs to know >>> that the deltas since its last read will not have been compacted into the >>> base. This implies support for multiple generations of deltas, as different >>> consumers could be at different points in the stream of deltas. >>> >>>>>> On Wed, May 15, 2019 at 3:55 PM Ryan Blue <rb...@netflix.com.invalid> >>>>>> wrote: >>>>>> 2. Iceberg diff files should use synthetic keys >>>>>> >>>>>> A lot of the discussion on the doc is about whether natural keys are >>>>>> practical or what assumptions we can make or trade about them. In my >>>>>> opinion, Iceberg tables will absolutely need natural keys for reasonable >>>>>> use cases. And those natural keys will need to be unique. And Iceberg >>>>>> will need to rely on engines to enforce that uniqueness. >>>>>> >>>>>> But, there is a difference between table behavior and implementation. We >>>>>> can use synthetic keys to implement the requirements of natural keys. >>>>>> Each row should be identified by its file and position in a file. When >>>>>> deleting by a natural key, we just need to find out what the synthetic >>>>>> key is and encode that in the delete diff. >>>>>> >>>>> This comment has important implications for the effort required to >>>>> generate delete diff files. I've tried to cover why in comments I added >>>>> today to the doc, but it could also be a topic of the hangout. >>>> >>>> Do you mean that you can't encode a delete without reading data to locate >>>> the affected rows? >>> >>> Yes. >>> >>>> >>>>>> 3. Synthetic keys should be based on filename and position >>>>>> >>>>>> I think identifying the file in a synthetic key makes a lot of sense. >>>>>> This would allow for delta file reuse as individual files are rewritten >>>>>> by a “major” compaction and provides nice flexibility that fits with the >>>>>> format. We will need to think through all the impacts, like how file >>>>>> relocation works (e.g., move between regions) and the requirements for >>>>>> rewrites (must apply the delta when rewriting). >>>>>> >>>>> I'm confused. I feel like specifying the filename has the opposite >>>>> effect. One of the biggest advantages of Iceberg is the decoupling of a >>>>> dataset from physical location of the constituent files. If a delta file >>>>> encodes the filename of the row that it updates/deletes you are putting a >>>>> significant constraint on the way that an implementation can manipulate >>>>> those files later. >>>> >>>> If I understand your concern, it is that we are encoding a file location >>>> in the delete diff. We could solve this with a level of indirection like >>>> an ID for data files in table metadata. So there are options to make sure >>>> we can still move files and data around. >>>> >>>> What I like about using a filename or file-specific identifier is that the >>>> deltas are tied to a particular file. When that file is deleted, the delta >>>> no longer needs to be carried forward. So if I have a maintenance job that >>>> is compacting small files, it must apply deletes when rewriting all of >>>> those files. But we don't have to rewrite or replace the delete diff file >>>> because its deletes were scoped to a file that no longer exists. >>> >>> My intuition is that you overestimate the complexity of maintaining a >>> sequence of deltas and underestimate the constraints that would be imposed >>> by the "level of indirection like an ID for data files in table metadata". >>> Prior to this proposal, there were many legitimate and useful things one >>> could do to the files making up a dataset between two snapshots. For >>> example, one could split a file into multiple files using a new partition >>> scheme. How does the level of indirection work when what was previously one >>> file becomes five? >>> >>>> This is a good way around an ugly problem of knowing when to apply a >>>> particular delete diff. Say we are using a UUID column for a natural key. >>>> Then deletes just need to encode a set of UUIDs. But when a row is >>>> upserted, it gets deleted and then re-appended with the same (natural) >>>> UUID key. So we have to scope the delete to just the file that contained >>>> the original row and not the inserted data file. It is tempting to use >>>> snapshot ordering for this (delete diff applies to versions < V) but we >>>> eventually lose the order of snapshots when they expire. We could also use >>>> insert diff files, but then replacing the same row more than once has the >>>> same problem (how does a delete apply to an insert diff?). Then to >>>> correctly encode state, we would have to either solve the order problem or >>>> require a minor compaction. >>> >>> > lose the order of snapshots when they expire >>> >>> Generations of data can be considered distinct from snapshots. Within a >>> given snapshot, the files would be grouped by generation. The rules for >>> expiring snapshots would be orthogonal to the rules for compacting >>> generations. And the rules for GC'ing individual files would be unaffected. >>> >>>> Scoping the delete diffs using a file solves this problem tying the >>>> lifecycle of deltas to the lifecycle of files. If a file is live in the >>>> table, then all deletes for that file must be applied when reading or >>>> compacting it. When compacting, you can either write a delete diff for the >>>> new file or remove the deleted records. That requirement seems pretty >>>> reasonable to me. >>> >>> Iceberg has always had to consider which files need to be accessed in order >>> to find all data relevant to a query. Applying deltas to a base (whether >>> for merge-on-read or for compaction) is not different. One first translates >>> filter predicates to partition predicates, picks up all partitions that >>> might contain relevant data, applies remaining filter predicates, and then >>> one collapses base data with subsequent deltas on the fly.