Yep, sounds good. Thank you for putting this together, it certainly adds a lot of clarification to the discussion, and opens a clear path forward.
+1 from me (or +3 if we add Miguel and Anton) on going ahead with this approach Sent from my iPhone > On 7 Jun 2019, at 21:23, Erik Wright <erik.wri...@shopify.com> wrote: > > Hi Cristian, > > I'm glad to hear that this feels like alignment to you. I agree that it > should closely correspond to the option that you described. My intent in > writing this out was to clarify some of the details because they were a > sticking point in the live discussion. > >> One thing though: guaranteeing uniques doesn't seem possible with large >> datasets. In fact, no real big data system does that (We posted some >> examples in the comment thread on our proposal doc). > > Clearly it is "possible" to guarantee uniques by comparing deltas to a base > version during the commit process. But, as you alluded to, and as I mention > in the document, the cost of doing so is highly dependent on the nature of > the dataset and the updates to it. Furthermore, the likelihood of duplicates > is highly variable, as is the cost of violations. > > With all of those things in mind, I have recommended making this enforcement > optional (if available at all). Instead I have chosen to describe the > potential impact of errors in the dataset production. I identified two > possible errors and illustrated how those errors might be observed by a > consumer. > >> The good news however, is we don't need uniqueness for all of this to work. >> It works just the same even if there are duplicates, however with a bit more >> care in implementation to handle >> some corner cases consistently. > > Agreed, that under certain conditions, deterministic behaviour can be > achieved in the face of unintended duplicates in input. And this would be my > recommended approach. > > Thanks for taking the time to read my document. > > Erik > >> On Fri, Jun 7, 2019 at 3:27 PM Cristian Opris <cop...@apple.com> wrote: >> Hey Erik, >> >> I was able to parse your proposal quite quickly since it seems to be pretty >> much exactly the "lazy with natural key" in our proposed approach. >> >> >> I must say it is however much concretely expressed. >> >> This natural key approach is in fact our preferred option too, even though >> it wasn't explicitly said in the document. I guess we wanted to get more >> feedback just in case we're missing something. Turns out we weren't :) >> >> In particular, I fully agree with this: >> >> "The effective dataset of any other version is the equivalent of the >> previous version, anti-join the deletes in the current version (on the >> unique row identifier), union the insertions in the current version. Optimal >> performance will be achieved when consistent partition specifications and >> sorting are used from one version to the next (minimizing the number of >> deletes that need to be compared to each insertion)." >> >> Yes, nothing more than sorted natural keys is needed to achieve a good >> implementation that covers all use cases. >> >> One thing though: guaranteeing uniques doesn't seem possible with large >> datasets. In fact, no real big data system does that (We posted some >> examples in the comment thread on our proposal doc). >> >> It's not clear from your doc how enforcing uniqueness would work ? >> >> The good news however, is we don't need uniqueness for all of this to work. >> It works just the same even if there are duplicates, however with a bit more >> care in implementation to handle >> some corner cases consistently. >> >> Let me know if I got the gist of your proposal right ? >> >> Thanks, >> Cristian >> >> >>> On 7 Jun 2019, at 19:47, Ryan Blue <rb...@netflix.com.INVALID> wrote: >>> >>> Thanks, Erik! Great to see progress here. I'll set aside some time to look >>> this over in detail. >>> >>> On Fri, Jun 7, 2019 at 11:46 AM Erik Wright <erik.wri...@shopify.com> wrote: >>>> I apologize for the delay, but I have finally put together a document >>>> describing an alternative approach to supporting updates in Iceberg while >>>> minimizing write amplification. >>>> >>>> Proposal: Iceberg Merge-on-Read >>>> >>>> Thank you, Anton and Miguel, for starting this conversation, and everyone >>>> else as well for the ongoing dialogue. I'm looking forward to continuing >>>> to discuss this and hopefully finding an approach that can meet our >>>> different needs and that we can work together on. >>>> >>>> Cheers, >>>> >>>> Erik >>>> >>>>> On Wed, May 29, 2019 at 7:26 PM Jacques Nadeau <jacq...@dremio.com> wrote: >>>>> Yeah, I totally forgot to record our discussion. Will do so next time, >>>>> sorry. >>>>> -- >>>>> Jacques Nadeau >>>>> CTO and Co-Founder, Dremio >>>>> >>>>> >>>>>> On Wed, May 29, 2019 at 4:24 PM Ryan Blue <rb...@netflix.com> wrote: >>>>>> It wasn't recorded, but I can summarize what we talked about. Sorry I >>>>>> haven't sent this out earlier. >>>>>> >>>>>> We talked about the options and some of the background in Iceberg -- >>>>>> basically that it isn't possible to determine the order of commits >>>>>> before you commit so you can't rely on some monotonically increasing >>>>>> value from a snapshot to know which deltas to apply to a file. The >>>>>> result is that we can't apply diffs to data files using a rule like >>>>>> "files older than X" because we can't identify those files without the >>>>>> snapshot history. >>>>>> >>>>>> That gives us basically 2 options for scoping delete diffs: either >>>>>> identify the files to apply a diff to when writing the diff, or log >>>>>> changes applied to a snapshot and keep the snapshot history around >>>>>> (which is how we know the order of snapshots). The first option is not >>>>>> good if you want to write without reading data to determine where the >>>>>> deleted records are. The second prevents cleaning up snapshot history. >>>>>> >>>>>> We also talked about whether we should encode IDs in data files. Jacques >>>>>> pointed out that retrying a commit is easier if you don't need to >>>>>> re-read the original data to reconcile changes. For example, if a data >>>>>> file was compacted in a concurrent write, how do we reconcile a delete >>>>>> for it? We discussed other options, like rolling back the compaction for >>>>>> delete events. I think that's a promising option. >>>>>> >>>>>> For action items, Jacques was going to think about whether we need to >>>>>> encode IDs in data files or if we could use positions to identify rows >>>>>> and write up a summary/proposal. Erik was going to take on planning how >>>>>> identifying rows without reading data would work and similarly write up >>>>>> a summary/proposal. >>>>>> >>>>>> That's from memory, so if I've missed anything, I hope that other >>>>>> attendees will fill in the details! >>>>>> >>>>>> rb >>>>>> >>>>>>> On Wed, May 29, 2019 at 3:34 PM Venkatakrishnan Sowrirajan >>>>>>> <vsowr...@asu.edu> wrote: >>>>>>> Hi Ryan, >>>>>>> >>>>>>> I couldn't attend the meeting. Just curious, if this is recorded by any >>>>>>> chance. >>>>>>> >>>>>>> Regards >>>>>>> Venkata krishnan >>>>>>> >>>>>>> >>>>>>>> On Fri, May 24, 2019 at 8:49 AM Ryan Blue <rb...@netflix.com.invalid> >>>>>>>> wrote: >>>>>>>> Yes, I agree. I'll talk a little about a couple of the constraints of >>>>>>>> this as well. >>>>>>>> >>>>>>>>> On Fri, May 24, 2019 at 5:52 AM Anton Okolnychyi >>>>>>>>> <aokolnyc...@apple.com> wrote: >>>>>>>>> The agenda looks good to me. I think it would also make sense to >>>>>>>>> clarify the responsibilities of query engines and Iceberg. Not only >>>>>>>>> in terms of uniqueness, but also in terms of applying diffs on read, >>>>>>>>> for example. >>>>>>>>> >>>>>>>>>> On 23 May 2019, at 01:59, Ryan Blue <rb...@netflix.com.INVALID> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Here’s a rough agenda: >>>>>>>>>> >>>>>>>>>> Use cases: everyone come with a use case that you’d like to have >>>>>>>>>> supported. We’ll go around and introduce ourselves and our use cases. >>>>>>>>>> Main topic: How should Iceberg identify rows that are deleted? >>>>>>>>>> Side topics from my initial email, if we have time: should we use >>>>>>>>>> insert diffs, should we support dense and sparse formats, etc. >>>>>>>>>> The main topic I think we should discuss is: How should Iceberg >>>>>>>>>> identify rows that are deleted? >>>>>>>>>> >>>>>>>>>> I’m phrasing it this way to avoid where I think we’re talking past >>>>>>>>>> one another because we are making assumptions. The important thing >>>>>>>>>> is that there are two main options: >>>>>>>>>> >>>>>>>>>> Filename and position, vs >>>>>>>>>> Specific values of (few) columns in the data >>>>>>>>>> This phrasing also avoids discussing uniqueness constraints. Once we >>>>>>>>>> get down to behavior, I think we agree. For example, I think we all >>>>>>>>>> agree that uniqueness cannot be enforced in Iceberg. >>>>>>>>>> >>>>>>>>>> If uniqueness can’t be enforced in Iceberg, the main choice comes >>>>>>>>>> down to how we identify rows that are deleted. If we use (filename, >>>>>>>>>> position) then we know that there is only one row. On the other >>>>>>>>>> hand, if we use data values to identify rows then a delete may >>>>>>>>>> identify more than one row because there are no uniqueness >>>>>>>>>> guarantees. I think we also agree that if there is more than one row >>>>>>>>>> identified, all of them should be deleted. >>>>>>>>>> >>>>>>>>>> At that point, there are trade-offs between the approaches: >>>>>>>>>> >>>>>>>>>> When identifying deleted rows by data values, situations like the >>>>>>>>>> one that Anton pointed out are possible. >>>>>>>>>> Jacques also had a good point about concurrency. If at all possible, >>>>>>>>>> we want to be able to reconcile changes between concurrent commits >>>>>>>>>> without re-running an operation. >>>>>>>>>> Sound like a reasonable amount to talk through? >>>>>>>>>> >>>>>>>>>> rb >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Wed, May 22, 2019 at 1:17 PM Erik Wright >>>>>>>>>>> <erik.wri...@shopify.com> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Wed, May 22, 2019 at 4:04 PM Cristian Opris >>>>>>>>>>>> <cop...@apple.com.invalid> wrote: >>>>>>>>>>>> Agreed with Erik here, we're certainly not looking to build the >>>>>>>>>>>> equivalent of a relational database, and for that matter not even >>>>>>>>>>>> that of a local disk storage analytics database (like Vertica). >>>>>>>>>>>> Those are very >>>>>>>>>>>> different designs with very different trade-offs and optimizations. >>>>>>>>>>>> >>>>>>>>>>>> We're looking to automate and optimize specific types of file >>>>>>>>>>>> manipulation for large files on remote storage, while presenting >>>>>>>>>>>> that to the user under the common SQL API for bulk data >>>>>>>>>>>> manipulation (MERGE INTO) >>>>>>>>>>> >>>>>>>>>>> What I would encourage is to decouple the storage model from the >>>>>>>>>>> implementation of that API. If Iceberg has support for >>>>>>>>>>> merge-on-read of upserts and deletes, in addition to its powerful >>>>>>>>>>> support for partitioning, it will be easy for a higher-level >>>>>>>>>>> application to implement those APIs given certain other constraints >>>>>>>>>>> (that might not be appropriate to all applications). >>>>>>>>>>> >>>>>>>>>>>> Myself and Miguel are out on Friday, but Anton should be able to >>>>>>>>>>>> handle the discussion on our side. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Cristian >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> On 22 May 2019, at 17:51, Erik Wright >>>>>>>>>>>>>> <erik.wri...@shopify.com.INVALID> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> We have two rows with the same natural key and we use that >>>>>>>>>>>>>> natural key in diff files: >>>>>>>>>>>>>> nk | col1 | col2 >>>>>>>>>>>>>> 1 | 1 | 1 >>>>>>>>>>>>>> 1 | 2 | 2 >>>>>>>>>>>>>> Then we have a delete statement: >>>>>>>>>>>>>> DELETE FROM t WHERE col1 = 1 >>>>>>>>>>>>> >>>>>>>>>>>>> I think this example cuts to the point of the differences of >>>>>>>>>>>>> understanding. Does Iceberg want to be approaching the utility of >>>>>>>>>>>>> a relational database, against which I can execute complex update >>>>>>>>>>>>> queries? This is not what I would have imagined. >>>>>>>>>>>>> >>>>>>>>>>>>> I would have, instead, imagined that it was up to the client to >>>>>>>>>>>>> identify, through whatever means, that they want to update or >>>>>>>>>>>>> delete a row with a given ID. If there are multiple (distinct) >>>>>>>>>>>>> rows with the same ID, _too bad_. Any user should _expect_ that >>>>>>>>>>>>> they could potentially see any one or more of those rows at read >>>>>>>>>>>>> time. And that an upsert/delete would affect any/all of them (I >>>>>>>>>>>>> would argue for all). >>>>>>>>>>>>> >>>>>>>>>>>>> In summary: Instead of trying to come up with a consistent, >>>>>>>>>>>>> logical handling for complex queries that are best suited for a >>>>>>>>>>>>> relational database, leave such handling up to the client and >>>>>>>>>>>>> concentrate on problems that can be solved simply and more >>>>>>>>>>>>> generally. >>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, May 22, 2019 at 12:11 PM Ryan Blue >>>>>>>>>>>>>> <rb...@netflix.com.invalid> wrote: >>>>>>>>>>>>>> Yes, I think we should. I was going to propose one after >>>>>>>>>>>>>> catching up on the rest of this thread today. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, May 22, 2019 at 9:08 AM Anton Okolnychyi >>>>>>>>>>>>>>> <aokolnyc...@apple.com> wrote: >>>>>>>>>>>>>>> Thanks! Would it make sense to discuss the agenda in advance? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 22 May 2019, at 17:04, Ryan Blue >>>>>>>>>>>>>>>> <rb...@netflix.com.INVALID> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I sent out an invite and included everyone on this thread. If >>>>>>>>>>>>>>>> anyone else would like to join, please join the Zoom meeting. >>>>>>>>>>>>>>>> If you'd like to be added to the calendar invite, just let me >>>>>>>>>>>>>>>> know and I'll add you. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, May 22, 2019 at 8:57 AM Jacques Nadeau >>>>>>>>>>>>>>>> <jacq...@dremio.com> wrote: >>>>>>>>>>>>>>>>> works for me. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> To make things easier, we can use my zoom meeting if people >>>>>>>>>>>>>>>>> like: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Join Zoom Meeting >>>>>>>>>>>>>>>>> https://zoom.us/j/4157302092 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> One tap mobile >>>>>>>>>>>>>>>>> +16465588656,,4157302092# US (New York) >>>>>>>>>>>>>>>>> +16699006833,,4157302092# US (San Jose) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Dial by your location >>>>>>>>>>>>>>>>> +1 646 558 8656 US (New York) >>>>>>>>>>>>>>>>> +1 669 900 6833 US (San Jose) >>>>>>>>>>>>>>>>> 877 853 5257 US Toll-free >>>>>>>>>>>>>>>>> 888 475 4499 US Toll-free >>>>>>>>>>>>>>>>> Meeting ID: 415 730 2092 >>>>>>>>>>>>>>>>> Find your local number: https://zoom.us/u/aH9XYBfm >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Jacques Nadeau >>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, May 22, 2019 at 8:54 AM Ryan Blue >>>>>>>>>>>>>>>>>> <rb...@netflix.com.invalid> wrote: >>>>>>>>>>>>>>>>>> 9AM on Friday works best for me. How about then? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, May 22, 2019 at 5:05 AM Anton Okolnychyi >>>>>>>>>>>>>>>>>>> <aokolnyc...@apple.com> wrote: >>>>>>>>>>>>>>>>>>> What about this Friday? One hour slot from 9:00 to 10:00 am >>>>>>>>>>>>>>>>>>> or 10:00 to 11:00 am PST? Some folks are based in London, >>>>>>>>>>>>>>>>>>> so meeting later than this is hard. If Friday doesn’t work, >>>>>>>>>>>>>>>>>>> we can consider Tuesday or Wednesday next week. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On 22 May 2019, at 00:54, Jacques Nadeau >>>>>>>>>>>>>>>>>>>> <jacq...@dremio.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I agree with Anton that we should probably spend some time >>>>>>>>>>>>>>>>>>>> on hangouts further discussing things. Definitely >>>>>>>>>>>>>>>>>>>> differing expectations here and we seem to be talking a >>>>>>>>>>>>>>>>>>>> bit past each other. >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Jacques Nadeau >>>>>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Tue, May 21, 2019 at 3:44 PM Cristian Opris >>>>>>>>>>>>>>>>>>>>> <cop...@apple.com.invalid> wrote: >>>>>>>>>>>>>>>>>>>>> I love a good flame war :P >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On 21 May 2019, at 22:57, Jacques Nadeau >>>>>>>>>>>>>>>>>>>>>>> <jacq...@dremio.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> That's my point, truly independent writers (two Spark >>>>>>>>>>>>>>>>>>>>>>> jobs, or a Spark job and Dremio job) means a >>>>>>>>>>>>>>>>>>>>>>> distributed transaction. It would need yet another >>>>>>>>>>>>>>>>>>>>>>> external transaction coordinator on top of both Spark >>>>>>>>>>>>>>>>>>>>>>> and Dremio, Iceberg by itself >>>>>>>>>>>>>>>>>>>>>>> cannot solve this. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I'm not ready to accept this. Iceberg already supports a >>>>>>>>>>>>>>>>>>>>>> set of semantics around multiple writers committing >>>>>>>>>>>>>>>>>>>>>> simultaneously and how conflict resolution is done. The >>>>>>>>>>>>>>>>>>>>>> same can be done here. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> MVCC (which is what Iceberg tries to implement) requires >>>>>>>>>>>>>>>>>>>>> a total ordering of snapshots. Also the snapshots need to >>>>>>>>>>>>>>>>>>>>> be non-conflicting. I really don't see how any metadata >>>>>>>>>>>>>>>>>>>>> data structures can solve this without an outside >>>>>>>>>>>>>>>>>>>>> coordinator. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Consider this: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Snapshot 0: (K,A) = 1 >>>>>>>>>>>>>>>>>>>>> Job X: UPDATE K SET A=A+1 >>>>>>>>>>>>>>>>>>>>> Job Y: UPDATE K SET A=10 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> What should the final value of A be and who decides ? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> By single writer, I don't mean single process, I mean >>>>>>>>>>>>>>>>>>>>>>> multiple coordinated processes like Spark executors >>>>>>>>>>>>>>>>>>>>>>> coordinated by Spark driver. The coordinator ensures >>>>>>>>>>>>>>>>>>>>>>> that the data is pre-partitioned on >>>>>>>>>>>>>>>>>>>>>>> each executor, and the coordinator commits the >>>>>>>>>>>>>>>>>>>>>>> snapshot. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Note however that single writer job/multiple concurrent >>>>>>>>>>>>>>>>>>>>>>> reader jobs is perfectly feasible, i.e. it shouldn't be >>>>>>>>>>>>>>>>>>>>>>> a problem to write from a Spark job and read from >>>>>>>>>>>>>>>>>>>>>>> multiple Dremio queries concurrently (for example) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> :D This is still "single process" from my perspective. >>>>>>>>>>>>>>>>>>>>>> That process may be coordinating other processes to do >>>>>>>>>>>>>>>>>>>>>> distributed work but ultimately it is a single process. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Fair enough >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I'm not sure what you mean exactly. If we can't enforce >>>>>>>>>>>>>>>>>>>>>>> uniqueness we shouldn't assume it. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I disagree. We can specify that as a requirement and >>>>>>>>>>>>>>>>>>>>>> state that you'll get unintended consequences if you >>>>>>>>>>>>>>>>>>>>>> provide your own keys and don't maintain this. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> There's no need for unintended consequences, we can >>>>>>>>>>>>>>>>>>>>> specify consistent behaviour (and I believe the document >>>>>>>>>>>>>>>>>>>>> says what that is) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> We do expect that most of the time the natural key is >>>>>>>>>>>>>>>>>>>>>>> unique, but the eager and lazy with natural key designs >>>>>>>>>>>>>>>>>>>>>>> can handle duplicates >>>>>>>>>>>>>>>>>>>>>>> consistently. Basically it's not a problem to have >>>>>>>>>>>>>>>>>>>>>>> duplicate natural keys, everything works fine. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> That heavily depends on how things are implemented. For >>>>>>>>>>>>>>>>>>>>>> example, we may write a bunch of code that generates >>>>>>>>>>>>>>>>>>>>>> internal data structures based on this expectation. If >>>>>>>>>>>>>>>>>>>>>> we have to support duplicate matches, all of sudden we >>>>>>>>>>>>>>>>>>>>>> can no longer size various data structures to improve >>>>>>>>>>>>>>>>>>>>>> performance and may be unable to preallocate memory >>>>>>>>>>>>>>>>>>>>>> associated with a guaranteed completion. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Again we need to operate on the assumption that this is a >>>>>>>>>>>>>>>>>>>>> large scale distributed compute/remote storage scenario. >>>>>>>>>>>>>>>>>>>>> Key matching is done with shuffles with data movement >>>>>>>>>>>>>>>>>>>>> across the network, such optimizations would really have >>>>>>>>>>>>>>>>>>>>> little impact on overall performance. Not to mention that >>>>>>>>>>>>>>>>>>>>> most query engines would already optimize the shuffle >>>>>>>>>>>>>>>>>>>>> already as much as it can be optimized. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> It is true that if actual duplicate keys would make the >>>>>>>>>>>>>>>>>>>>> key matching join (anti-join) somewhat more expensive, >>>>>>>>>>>>>>>>>>>>> however it can be done in such a way that if the keys are >>>>>>>>>>>>>>>>>>>>> in practice unique the join is as efficient as it can be. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Let me try and clarify each point: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - lookup for query or update on a >>>>>>>>>>>>>>>>>>>>>>> non-(partition/bucket/sort) key predicate implies >>>>>>>>>>>>>>>>>>>>>>> scanning large amounts of data - because these are the >>>>>>>>>>>>>>>>>>>>>>> only data structures that can narrow down the lookup, >>>>>>>>>>>>>>>>>>>>>>> right ? One could argue that the min/max index (file >>>>>>>>>>>>>>>>>>>>>>> skipping) can be applied to any column, but in reality >>>>>>>>>>>>>>>>>>>>>>> if that column is not sorted the min/max intervals can >>>>>>>>>>>>>>>>>>>>>>> have huge overlaps so it may be next to useless. >>>>>>>>>>>>>>>>>>>>>>> - remote storage - this is a critical architecture >>>>>>>>>>>>>>>>>>>>>>> decision - implementations on local storage imply a >>>>>>>>>>>>>>>>>>>>>>> vastly different design for the entire system, storage >>>>>>>>>>>>>>>>>>>>>>> and compute. >>>>>>>>>>>>>>>>>>>>>>> - deleting single records per snapshot is unfeasible in >>>>>>>>>>>>>>>>>>>>>>> eager but also particularly in the lazy design: each >>>>>>>>>>>>>>>>>>>>>>> deletion creates a very small snapshot. Deleting 1 >>>>>>>>>>>>>>>>>>>>>>> million records one at a time would create 1 million >>>>>>>>>>>>>>>>>>>>>>> small files, and 1 million RPC calls. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Why is this unfeasible? If I have a dataset of 100mm >>>>>>>>>>>>>>>>>>>>>> files including 1mm small files, is that a major >>>>>>>>>>>>>>>>>>>>>> problem? It seems like your usecase isn't one where you >>>>>>>>>>>>>>>>>>>>>> want to support single record deletes but it is >>>>>>>>>>>>>>>>>>>>>> definitely something important to many people. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 100 mm total files or 1 mm files per dataset is >>>>>>>>>>>>>>>>>>>>> definitely a problem on HDFS, and I believe on S3 too. >>>>>>>>>>>>>>>>>>>>> Single key delete would work just fine, but it's simply >>>>>>>>>>>>>>>>>>>>> not optimal to do that on remote storage. This is a very >>>>>>>>>>>>>>>>>>>>> well known problem with HDFS, and one of the very reasons >>>>>>>>>>>>>>>>>>>>> to have something like Iceberg in the first place. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Basically the users would be able to do single key >>>>>>>>>>>>>>>>>>>>> mutation, but it's not the use case we should be >>>>>>>>>>>>>>>>>>>>> optimizing for, but it's really not advisable. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Eager is conceptually just lazy + compaction done, >>>>>>>>>>>>>>>>>>>>>>> well, eagerly. The logic for both is exactly the same, >>>>>>>>>>>>>>>>>>>>>>> the trade-off is just that with eager you implicitly >>>>>>>>>>>>>>>>>>>>>>> compact every time so that you don't do any work on >>>>>>>>>>>>>>>>>>>>>>> read, while with lazy >>>>>>>>>>>>>>>>>>>>>>> you want to amortize the cost of compaction over >>>>>>>>>>>>>>>>>>>>>>> multiple snapshots. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Basically there should be no difference between the two >>>>>>>>>>>>>>>>>>>>>>> conceptually, or with regard to keys, etc. The only >>>>>>>>>>>>>>>>>>>>>>> difference is some mechanics in implementation. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I think you have deconstruct the problem too much to say >>>>>>>>>>>>>>>>>>>>>> these are the same (or at least that is what I'm >>>>>>>>>>>>>>>>>>>>>> starting to think given this thread). It seems like real >>>>>>>>>>>>>>>>>>>>>> world implementation decisions (per our discussion here) >>>>>>>>>>>>>>>>>>>>>> are in conflict. For example, you just argued against >>>>>>>>>>>>>>>>>>>>>> having a 1mm arbitrary mutations but I think that is >>>>>>>>>>>>>>>>>>>>>> because you aren't thinking about things over time with >>>>>>>>>>>>>>>>>>>>>> a delta implementation. Having 10,000 mutations a day >>>>>>>>>>>>>>>>>>>>>> where we do delta compaction once a week >>>>>>>>>>>>>>>>>>>>>> and local file mappings (key to offset sparse bitmaps) >>>>>>>>>>>>>>>>>>>>>> seems like it could result in very good performance in a >>>>>>>>>>>>>>>>>>>>>> case where we're mutating small amounts of data. In this >>>>>>>>>>>>>>>>>>>>>> scenario, you may not do major compaction ever unless >>>>>>>>>>>>>>>>>>>>>> you get to a high enough percentage of records that have >>>>>>>>>>>>>>>>>>>>>> been deleted in the original dataset. That drives a very >>>>>>>>>>>>>>>>>>>>>> different set of implementation decisions from a >>>>>>>>>>>>>>>>>>>>>> situation where you're trying to restate an entire >>>>>>>>>>>>>>>>>>>>>> partition at once. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> We operate on 1 billion mutations per day at least. This >>>>>>>>>>>>>>>>>>>>> is the problem Iceberg wants to solve, I believe it's >>>>>>>>>>>>>>>>>>>>> stated upfront. 10000/day is not a big data problem. It >>>>>>>>>>>>>>>>>>>>> can be done fairly trivially and it would be supported, >>>>>>>>>>>>>>>>>>>>> but there's not much point in extra optimizing for this >>>>>>>>>>>>>>>>>>>>> use case I believe. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>>>> Software Engineer >>>>>>>>>>>>>>>>>> Netflix >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>> Software Engineer >>>>>>>>>>>>>>>> Netflix >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>> Software Engineer >>>>>>>>>>>>>> Netflix >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Software Engineer >>>>>>>>>> Netflix >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>