Sounds like we’re in agreement on the direction! Let’s have a sync up sometime next week to make sure we are agreed and plan some of this work. What days work best for everyone?
Here’s a quick summary of the approach from my perspective: - Will define two options for writing delete files that identify rows by: 1. File and offset for MERGE INTO or similar operations, and 2. Column equality for encoding deletes without reading data - Will add a delete file type to manifests - Will add sequence numbers to existing metadata to track when to apply delete files to data files (in time), used by both delete file options - Will add a tracking sequence number to manifest file metadata in the manifest list, as well as a min sequence number for each manifest (min data seq as well?) - For column equality deletes, the partition of the file in the manifest will track the partition that it applies to (stats filtering to be done at read time) I’d also like to have a way to encode deletes across the entire dataset, for GDPR deletes (e.g. delete a user from all sessions partitioned by time). Any ideas about how to encode this? More replies to specific points: If there is a monotonically increasing value per snapshot, should we simply update the requirement for the snapshot ID to require it to be monotonically increasing? I was thinking about this and I don’t think it’s a good idea, but I could be convinced otherwise. When compacting data and delete files, the sequence number of the compacted result is set to the sequence of the latest delete file applied to the data, but the compacted file is actually added in the replace snapshot. So we wouldn’t be able to identify the files that were added by a compaction without running a diff with another snapshot. I think that we don’t want to lose that information because we want to be able to recover what operation created any given file, if the snapshot is still around with metadata. The sequence number assigned to the resulting files is irrelevant if *all* relevant deletes have been applied. I agree, but we can also use the sequence number to avoid commit conflicts. If a delete is committed just before a compaction, we don’t want to re-run the compaction. Using the highest sequence number from the delete files applied to the data, we can confidently commit the compaction, knowing that any new deltas will be applied. Ultimately, all files would appear to be eligible for compaction in any rewrite snapshot. Yes, which I think is a benefit, right? I don’t think that the other approach inhibits aging off data, it just represents the deletion of the data differently. True, but if we can preserve the existing functionality then this is a quick operation. Usually, these file-level deletes align with partitioning so they are quick to run. And, we can prune any delete files that no longer have data files after it happens. On Thu, Jun 20, 2019 at 8:06 AM Erik Wright <erik.wri...@shopify.com> wrote: > > > On Wed, Jun 19, 2019 at 6:39 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Replies inline. >> >> On Mon, Jun 17, 2019 at 10:59 AM Erik Wright >> erik.wri...@shopify.com.invalid >> <http://mailto:erik.wri...@shopify.com.invalid> wrote: >> >> Because snapshots and versions are basically the same idea, we don’t need >>>> both. If we were to add versions, they should replace snapshots. >>>> >>> >>> I'm a little confused by this. You mention sequence numbers quite a bit, >>> but then say that we should not introduce versions. From my understanding, >>> sequence numbers and versions are essentially identical. What has changed >>> is where they are encoded (in file metadata vs. in snapshot metadata). It >>> would also seem necessary to keep them distinct from snapshots in order to >>> be able to collapse/compact files older than N (oldest valid version) >>> without losing the ability to incrementally apply the changes from N + 1. >>> >> Snapshots are versions of the table, but Iceberg does not use a >> monotonically increasing identifier to track them. Your proposal added >> table versions in addition to snapshots to get a monotonically increasing >> ID, but my point is that we don’t need to duplicate the “table state at >> some time” idea by adding a layer of versions inside snapshots. We just >> need a monotonically increasing ID associated with each snapshot. That ID >> what I’m calling a sequence number. Each snapshot will have a new sequence >> number used to track the order of files for applying changes that are >> introduced by that snapshot. >> > If there is a monotonically increasing value per snapshot, should we > simply update the requirement for the snapshot ID to require it to be > monotonically increasing? > > >> With sequence numbers embedded in metadata, you can still compact. If you >> have a file with sequence number N, then updated by sequence numbers N+2 >> and N+3, you can compact the changes from N and N+2 by writing the >> compacted file with sequence N+2. Rewriting files in N+2 is still possible, >> just like with the scheme you suggested, where the version N+2 would be >> rewritten. >> >> A key difference is that because the approach using sequence numbers >> doesn’t require keeping state around in the root metadata file, tables are >> not required to rewrite or compact data files just to keep the table >> functioning. If the data won’t be read, then why bother changing it? >> >> Because I am having a really hard time seeing how things would work >>> without these sequence numbers (versions), and given their prominence in >>> your reply, my remaining comments assume they are present, distinct from >>> snapshots, and monotonically increasing. >>> >> Yes, a sequence number would be added to the existing metadata. Each >> snapshot would produce a new sequence number. >> >> Another way of thinking about it: if versions are stored as data/delete >>>> file metadata and not in the table metadata file, then the complete history >>>> can be kept. >>> >>> >>> Sure. Although version expiry would still be an important process to >>> improve read performance and reduce storage overhead. >>> >> Maybe. If it is unlikely that the data will be read, the best option may >> be to leave the original data file and associated delete file. >> > It's safe to say that is optional for any given use-case, but that it is a > required capability. > >> Based on your proposal, it would seem that we can basically collapse any >>> given data file with its corresponding deletions. It's important to >>> consider how this will affect incremental readers. The following make sense >>> to me: >>> >>> 1. We should track an oldest-valid-version in the table metadata. No >>> changes newer than this version should be collapsed, as a reader who has >>> consumed up to that version must be able to continue reading. >>> >>> Table versions are tracked by snapshots, which are kept for a window of >> time. The oldest valid version is the oldest snapshot in the table. New >> snapshots can always compact or rewrite older data and incremental >> consumers ignore those changes by skipping snapshots where the operation >> was replace because the data has not changed. >> > Incremental consumers operate on append, delete, and overwrite snapshots >> and consume data files that are added or deleted. With delete files, these >> would also consume delete files that were added in a snapshot. >> > Sure, when a whole file is deleted there is already a column for > indicating that, so we do not need to represent it a second time. > > >> It is unlikely that sequence number would be used by incremental >> consumers because the changes are already available for each snapshot. This >> distinction is the main reason why I’m calling these sequence numbers and >> not “version”. The sequence number for a file indicates what delete files >> need to be applied; files can be rewritten the new copy gets a different >> sequence number as you describe just below. >> > Interesting. This explains the difference in the way we are looking at > this. Yes, I agree that this ought to work and seems more consistent with > the way that Iceberg works currently. > >> >>> 1. Any data file strictly older than the oldest-valid-version is >>> eligible for delete collapsing. During delete collapsing, all deletions >>> up >>> to version N (any version <= the oldest-valid-version) are applied to the >>> file. *The newly created file should be assigned a sequence number >>> of N.* >>> >>> First, a data file is kept until the last snapshot that references it is >> expired. That enable incremental consumers without a restriction on >> rewriting the file in a later replace snapshot. >> >> Second, I agree with your description of rewriting the file. That’s what >> I was trying to say above. >> > Given that incremental consumers will skip "replace" snapshots and will > not attempt to incrementally read the changes introduced in version "N" > using the manifest list file for version "N+1", there would appear to be no > restrictions on how much data can be collapsed in any rewrite operation. > The sequence number assigned to the resulting files is irrelevant if _all_ > relevant deletes have been applied. > >> >>> 1. Any delete file that has been completely applied is eligible for >>> expiry. A delete file is completely applied if its sequence number is <= >>> all data files to which it may apply (based on partitioning data). >>> >>> Yes, I agree. >> >> >>> 1. Any data file equal to or older than the oldest-valid-version is >>> eligible for compaction. During compaction, some or all data files up to >>> version N (any version <= the oldest-valid version) are selected. There >>> may >>> not be any deletions applicable to any of the selected data files. The >>> selected data files are read, rewritten as desired (partitioning, >>> compaction, sorting, etc.). The new files are included in the new >>> snapshot (*with >>> sequence number N*) while the old files are dropped. >>> >>> Yes, I agree. >> > Ultimately, all files would appear to be eligible for compaction in any > rewrite snapshot. > >> >>> 1. Delete collapsing, delete expiry, and compaction may be performed >>> in a single commit. The oldest-valid-version may be advanced during this >>> process as well. The outcome must logically be consistent with applying >>> these steps independently. >>> >>> Because snapshots (table versions) are independent, they can always be >> expired. Incremental consumers must process a new snapshot before it is old >> enough to expire. The time period after which snapshots are expired is up >> to the process running ExpireSnapshots. >> > Agreed. > >> Important to note in the above is that, during collapsing/compaction, new >>> files may be emitted with a version number N that are older than the most >>> recent version. This is important to ensure that all deltas newer than N >>> are appropriately applied, and that incremental readers are able to >>> continue processing the dataset. >>> >> Because incremental consumers operate using snapshots and not sequence >> numbers, I think this is decoupled in the approach I’m proposing. >> > Agreed. > >> It is also compatible with file-level deletes used to age off old data >>>> (e.g. delete where day(ts) < '2019-01-01'). >>> >>> >>> This particular operation is not compatible with incremental consumption. >>> >> I disagree. This just encodes deletes for an entire file worth of rows in >> the manifest file instead of in a delete file. Current metadata tracks when >> files are deleted and we would also associate a sequence number with those >> changes, if we needed to. Incremental consumers will work like they do >> today: they will consume snapshots that contain changes >> (append/delete/overwrite) and will get the files that were changed in that >> snapshot. >> > Agreed. > > >> Also, being able to cleanly age off data is a hard requirement for >> Iceberg. We all need to comply with data policies with age limits and we >> need to ensure that we can cleanly apply those changes. >> > I don't think that the other approach inhibits aging off data, it just > represents the deletion of the data differently. In any case, I agree that > we can reuse the existing "deleted in snapshot" mechanism. > >> One must still generate a deletion that is associated with a new sequence >>> number. I could see a reasonable workaround where, in order to delete an >>> entire file (insertion file `x.dat`, sequence number X) one would reference >>> the same file a second time (deletion file `x.dat, sequence number Y > X). >>> Since an insertion file and a deletion file have compatible formats, there >>> is no need to rewrite this file simply to mark each row in it as deleted. A >>> savvy consumer would be able to tell that this is a whole-file deletion >>> while a naïve consumer would be able to apply them using the basic >>> algorithm. >>> >> With file-level deletes, the files are no longer processed by readers, >> unless an incremental reader needs to read all of the deletes from the file. >> >> If it seems like we are roughly on the same page I will take a stab at >>> updating that document to go to the same level of detail that it does now >>> but use the sequence-number approach. >>> >> Sure, that sounds good. >> > Phew. I think we are nearly there! > >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix