Hi Ryan, Thanks for providing more clarification on the current workings of the metadata system in Iceberg.
I must say that having read the Table Spec document, it's still quite difficult to understand in detail how everything works conceptually. Based on your description below, and particularly the "independence" property of snapshots that you describe, it feels to me that the abstract data structure that this is attempting to model is some kind of persistent tree (persistent in the sense of functional, immutable: https://en.wikipedia.org/wiki/Persistent_data_structure#Trees <https://en.wikipedia.org/wiki/Persistent_data_structure#Trees>) This kind of data structures seems to align well with the requirements of not being able to support in place updates (since we use write-once storage for metadata) and not changing unaffected parts (maximizing data sharing). I wonder if it would be possible to formalize the metadata system with such an abstract data structure. I think this will go a long way towards making sure that we understand the kind of operations that can be applied to it and the complexity bounds of those operations. Or, at the very least, it would be extremely helpful for everyone looking to understand Iceberg to have available a diagram of the current metadata design, as a tree with the nodes and edges annotated with the essential information they contain, and then present (perhaps also visually) the main operations that can be currently applied to this data structure ? I think this would help reduce the confusion around numbering systems, snapshot vs. version, etc. where I admit at least myself I'm a bit lost in understanding. It would also help in gaining the confidence that operations are all consistent, and there are no edge cases that may leave metadata in a weird state. Also, from what I can gather from a few email exchanges, the main reason for the current design is to allow uncoordinated concurrent writers to consistently modify the metadata structures with the lowest possible write latency ? Would be interesting if you can provide more detail on this particular use case, to better understand the rationale for the current design ? Thanks, Cristian > On 14 Jun 2019, at 23:29, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > First, I want to clear up some of the confusion with a bit of background on > how Iceberg works, but I have inline replies below as well. > > Conceptually, a snapshot is a table version. Each snapshot is the complete > table state at some point in time and each snapshot is independent. > Independence is a key property of snapshots because it allows us to maintain > table data and metadata. > > Snapshots track additions and deletions using per-file status > (ADDED/EXISTING/DELETED) and the snapshot ID when each file was added or > deleted. This can be efficiently loaded and used to process changes > incrementally. Snapshots are also annotated with the operation that created > each snapshot. This is used to identify snapshots with no logical changes > (replace) and snapshots with no deletions (append) to speed up incremental > consumption and operations like snapshot expiration. > > Readers use the current snapshot for isolation from writes. Because readers > may be using an old snapshot, Iceberg keeps snapshots around and only > physically deletes files when snapshots expire — data files referenced in a > valid snapshot cannot be deleted. Independence is important so that snapshots > can expire and get cleaned up without any other actions on the table data. > > Consider the alternative: if each snapshot depended on the previous one, then > Iceberg cannot expire them and physically delete data. Here’s an example > snapshot history: > > S1 > - add file_a.parquet > - add file_b.parquet > S2 > - delete file_a.parquet > S3 > - add file_c.parquet > To delete file_a.parquet, S1 and S2 need to be rewritten into S2’ that > contains only file_b.parquet. This would work, but takes more time than > dropping references and deleting files that are no longer used. An expensive > metadata rewrite to delete data isn’t ideal, and not taking the time to > rewrite makes metadata stack up in the table metadata file, that gets > rewritten often. > > Finally, Iceberg relies on regular metadata maintenance — manifest compaction > — to reduce both write volume and the number of file reads needed to plan a > scan. Snapshots also reuse metadata files from previous snapshots to reduce > write volume. > > On Wed, Jun 12, 2019 at 9:12 AM Erik Wright erik.wri...@shopify.com > <http://mailto:erik.wri...@shopify.com/> wrote: > > > > Maintaining separate manifest lists for each version requires keeping > versions in separate metadata trees, so we can’t merge manifest files > together across versions. We initially had this problem when we used only > fast appends for our streaming writes. We ended up with way too many metadata > files and job planning took a really long time. > If I understand correctly, the main concern here is for scenarios where > versions are essentially micro-batches. They accumulate rapidly, each one is > relatively small, and there may be many valid versions at a time even if > expiry happens regularly. The negative impact is many small versions cannot > be combined together. Even the manifest list files are many, and all of them > need to be read. > > > I’m not sure how you define micro-batch, but this happens when commits are on > the order of every 10 minutes and consist of large data files. Manifest > compaction is required to avoid thousands of small metadata files impacting > performance. > > > > Although snapshots are independent and can be aged off, this requires > rewriting the base version to expire versions. That means we can’t delete old > data files until versions are rewritten, so the problem affects versions > instead of snapshots. We would still have the problem of not being able to > delete old data until after a compaction. > I don't quite understand this point, so forgive me if the following is > irrelevant or obvious. > > You are correct that snapshots become less useful as a tool, compared to > versions, for expiring data. The process now goes through (1) expire a > version and then (2) invalidate the last snapshot containing that version. Of > course, fundamentally you need to keep version X around until you are willing > to compact delta Y (or more) into it. And this is (I believe) our common > purpose in this discussion: permit a trade-off between write amplification > and storage/read costs. > > > Hopefully the background above is useful. Requiring expensive data-file > compaction in order to compact version metadata or delete files is a big > problem. > > And you’re right that snapshots are less useful. Because snapshots and > versions are basically the same idea, we don’t need both. If we were to add > versions, they should replace snapshots. But I don’t think it is a good idea > to give up the independence. > > The good news is that we can use snapshots exactly like versions and avoid > these problems by adding sequence numbers. > > > > The value of decoupling versions from snapshots is: > > 1. It makes the process of correctly applying multiple deltas easy to reason > about. > 2. It allows us to issue updates that don't introduce new versions (i.e., > those that merely expire old versions, or that perform other optimizations). > 3. It enables clients to cleanly consume new deltas when those clients are > sophisticated enough to incrementally update their own outputs in function of > those changes. > > > Points 2 and 3 are available already, and 1 can be done with sequence numbers > in snapshot metadata. > > > > Here are some benefits to this approach: > > There is no explicit version that must be expired so compaction is only > required to optimize reads — partition or file-level delete can coexist with > deltas > Can you expand on this? > > > By adding a sequence number instead of versions, the essential version > information is embedded in the existing metadata files. Those files are > reused across snapshots so there is no need to compact deltas into data files > to reduce the write volume. > > Another way of thinking about it: if versions are stored as data/delete file > metadata and not in the table metadata file, then the complete history can be > kept. > > It is also compatible with file-level deletes used to age off old data (e.g. > delete where day(ts) < '2019-01-01'). When files match a delete, they are > removed from metadata and physically deleted when the delete snapshot is > expired. This is independent of the delete files that get applied on read. In > fact, if a delete file’s sequence number is less than all data files in a > partition, the delete itself can be removed and never compacted. > > > > When a delete’s sequence number is lower than all data files, it can be pruned > In combination with the previous comment, I have the impression that we might > be losing the distinction between snapshots and versions. In particular, what > I'm concerned about is that, from a consumer's point of view, it may be > difficult to figure out, from one snapshot to the next, what new deltas I > need to process to update my view. > > > We don’t need both and should go with either snapshots (independent) or > versions (dependent). I don’t think it is possible to make dependent versions > work. > > But as I noted above, the changes that occurred in a snapshot are available > as data file additions and data file deletions. Delete file additions and > deletions would be tracked the same way. Consumers would have full access to > the changes. > > > > Compaction can translate a natural key delete to a synthetic key delete by > adding a synthetic key delete file and updating the data file’s sequence > number > I don't understand what this means. > > > Synthetic keys are better than natural keys in several ways. They are cheaper > to apply at read time, they are scoped to a single data file, they compress > well, etc. If both are supported, compaction could rewrite a natural key > delete as a synthetic key delete to avoid rewriting the entire data file to > delete just a few rows. > > In addition, if a natural key delete is applied to an entire table, it would > be possible to change sequence numbers to apply this process incrementally. > But, this would change history just like rewriting the base version of a > table loses the changes in that version. So we would need to be careful about > when changing a sequence number is allowed. > > > > Synthetic and natural key approaches can coexist > Can you elaborate on what the synthetic key approach would imply, and what > use-cases it would serve? > > > Some use cases, like MERGE INTO, already require reading data so we can write > synthetic key delete files without additional work. That means Iceberg should > support not just natural key deletes, but also synthetic. > > > > I agree we are getting closer. I am confused by some things, and I suspect > that the difference lies in the use-cases I wish to support for incremental > consumers of these datasets. I have the impression that most of you folks are > primarily considering incremental writes that end up serving analytical use > cases (that want to see the full effective dataset as of the most recent > snapshot). > > > We have both use cases and I want to make sure Iceberg can be used for both. > > -- > Ryan Blue > Software Engineer > Netflix