Hi Ryan, I couldn't attend the meeting. Just curious, if this is recorded by any chance.
Regards Venkata krishnan On Fri, May 24, 2019 at 8:49 AM Ryan Blue <rb...@netflix.com.invalid> wrote: > Yes, I agree. I'll talk a little about a couple of the constraints of this > as well. > > On Fri, May 24, 2019 at 5:52 AM Anton Okolnychyi <aokolnyc...@apple.com> > wrote: > >> The agenda looks good to me. I think it would also make sense to clarify >> the responsibilities of query engines and Iceberg. Not only in terms of >> uniqueness, but also in terms of applying diffs on read, for example. >> >> On 23 May 2019, at 01:59, Ryan Blue <rb...@netflix.com.INVALID> wrote: >> >> Here’s a rough agenda: >> >> - Use cases: everyone come with a use case that you’d like to have >> supported. We’ll go around and introduce ourselves and our use cases. >> - Main topic: How should Iceberg identify rows that are deleted? >> - Side topics from my initial email, if we have time: should we use >> insert diffs, should we support dense and sparse formats, etc. >> >> The main topic I think we should discuss is: *How should Iceberg >> identify rows that are deleted?* >> >> I’m phrasing it this way to avoid where I think we’re talking past one >> another because we are making assumptions. The important thing is that >> there are two main options: >> >> - Filename and position, vs >> - Specific values of (few) columns in the data >> >> This phrasing also avoids discussing uniqueness constraints. Once we get >> down to behavior, I think we agree. For example, I think we all agree that >> uniqueness cannot be enforced in Iceberg. >> >> If uniqueness can’t be enforced in Iceberg, the main choice comes down to >> how we identify rows that are deleted. If we use (filename, position) then >> we know that there is only one row. On the other hand, if we use data >> values to identify rows then a delete may identify more than one row >> because there are no uniqueness guarantees. I think we also agree that if >> there is more than one row identified, all of them should be deleted. >> >> At that point, there are trade-offs between the approaches: >> >> - When identifying deleted rows by data values, situations like the >> one that Anton pointed out are possible. >> - Jacques also had a good point about concurrency. If at all >> possible, we want to be able to reconcile changes between concurrent >> commits without re-running an operation. >> >> Sound like a reasonable amount to talk through? >> >> rb >> >> On Wed, May 22, 2019 at 1:17 PM Erik Wright <erik.wri...@shopify.com> >> wrote: >> >>> >>> >>> On Wed, May 22, 2019 at 4:04 PM Cristian Opris <cop...@apple.com.invalid> >>> wrote: >>> >>>> Agreed with Erik here, we're certainly not looking to build the >>>> equivalent of a relational database, and for that matter not even that of a >>>> local disk storage analytics database (like Vertica). Those are very >>>> different designs with very different trade-offs and optimizations. >>>> >>>> We're looking to automate and optimize specific types of file >>>> manipulation for large files on remote storage, while presenting that to >>>> the user under the common SQL API for *bulk* data manipulation (MERGE >>>> INTO) >>>> >>> >>> What I would encourage is to decouple the storage model from the >>> implementation of that API. If Iceberg has support for merge-on-read of >>> upserts and deletes, in addition to its powerful support for partitioning, >>> it will be easy for a higher-level application to implement those APIs >>> given certain other constraints (that might not be appropriate to all >>> applications). >>> >>> Myself and Miguel are out on Friday, but Anton should be able to handle >>>> the discussion on our side. >>>> >>>> >>>> Thanks, >>>> Cristian >>>> >>>> >>>> On 22 May 2019, at 17:51, Erik Wright <erik.wri...@shopify.com.INVALID> >>>> wrote: >>>> >>>> We have two rows with the same natural key and we use that natural key >>>>> in diff files: >>>>> nk | col1 | col2 >>>>> 1 | 1 | 1 >>>>> 1 | 2 | 2 >>>>> Then we have a delete statement: >>>>> DELETE FROM t WHERE col1 = 1 >>>> >>>> >>>> I think this example cuts to the point of the differences of >>>> understanding. Does Iceberg want to be approaching the utility of a >>>> relational database, against which I can execute complex update queries? >>>> This is not what I would have imagined. >>>> >>>> I would have, instead, imagined that it was up to the client to >>>> identify, through whatever means, that they want to update or delete a row >>>> with a given ID. If there are multiple (distinct) rows with the same ID, >>>> _too bad_. Any user should _expect_ that they could potentially see any one >>>> or more of those rows at read time. And that an upsert/delete would affect >>>> any/all of them (I would argue for all). >>>> >>>> *In summary:* Instead of trying to come up with a consistent, logical >>>> handling for complex queries that are best suited for a relational >>>> database, leave such handling up to the client and concentrate on problems >>>> that can be solved simply and more generally. >>>> >>>> On Wed, May 22, 2019 at 12:11 PM Ryan Blue <rb...@netflix.com.invalid> >>>> wrote: >>>> >>>>> Yes, I think we should. I was going to propose one after catching up >>>>> on the rest of this thread today. >>>>> >>>>> On Wed, May 22, 2019 at 9:08 AM Anton Okolnychyi < >>>>> aokolnyc...@apple.com> wrote: >>>>> >>>>>> Thanks! Would it make sense to discuss the agenda in advance? >>>>>> >>>>>> On 22 May 2019, at 17:04, Ryan Blue <rb...@netflix.com.INVALID> >>>>>> wrote: >>>>>> >>>>>> I sent out an invite and included everyone on this thread. If anyone >>>>>> else would like to join, please join the Zoom meeting. If you'd like to >>>>>> be >>>>>> added to the calendar invite, just let me know and I'll add you. >>>>>> >>>>>> On Wed, May 22, 2019 at 8:57 AM Jacques Nadeau <jacq...@dremio.com> >>>>>> wrote: >>>>>> >>>>>>> works for me. >>>>>>> >>>>>>> To make things easier, we can use my zoom meeting if people like: >>>>>>> >>>>>>> Join Zoom Meeting >>>>>>> https://zoom.us/j/4157302092 >>>>>>> >>>>>>> One tap mobile >>>>>>> +16465588656,,4157302092# US (New York) >>>>>>> +16699006833,,4157302092# US (San Jose) >>>>>>> >>>>>>> Dial by your location >>>>>>> +1 646 558 8656 US (New York) >>>>>>> +1 669 900 6833 US (San Jose) >>>>>>> 877 853 5257 US Toll-free >>>>>>> 888 475 4499 US Toll-free >>>>>>> Meeting ID: 415 730 2092 >>>>>>> Find your local number: https://zoom.us/u/aH9XYBfm >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jacques Nadeau >>>>>>> CTO and Co-Founder, Dremio >>>>>>> >>>>>>> >>>>>>> On Wed, May 22, 2019 at 8:54 AM Ryan Blue <rb...@netflix.com.invalid> >>>>>>> wrote: >>>>>>> >>>>>>>> 9AM on Friday works best for me. How about then? >>>>>>>> >>>>>>>> On Wed, May 22, 2019 at 5:05 AM Anton Okolnychyi < >>>>>>>> aokolnyc...@apple.com> wrote: >>>>>>>> >>>>>>>>> What about this Friday? One hour slot from 9:00 to 10:00 am or >>>>>>>>> 10:00 to 11:00 am PST? Some folks are based in London, so meeting >>>>>>>>> later >>>>>>>>> than this is hard. If Friday doesn’t work, we can consider Tuesday or >>>>>>>>> Wednesday next week. >>>>>>>>> >>>>>>>>> On 22 May 2019, at 00:54, Jacques Nadeau <jacq...@dremio.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> I agree with Anton that we should probably spend some time on >>>>>>>>> hangouts further discussing things. Definitely differing expectations >>>>>>>>> here >>>>>>>>> and we seem to be talking a bit past each other. >>>>>>>>> -- >>>>>>>>> Jacques Nadeau >>>>>>>>> CTO and Co-Founder, Dremio >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 21, 2019 at 3:44 PM Cristian Opris < >>>>>>>>> cop...@apple.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> I love a good flame war :P >>>>>>>>>> >>>>>>>>>> On 21 May 2019, at 22:57, Jacques Nadeau <jacq...@dremio.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> That's my point, truly independent writers (two Spark jobs, or a >>>>>>>>>>> Spark job and Dremio job) means a distributed transaction. It would >>>>>>>>>>> need >>>>>>>>>>> yet another external transaction coordinator on top of both Spark >>>>>>>>>>> and >>>>>>>>>>> Dremio, Iceberg by itself >>>>>>>>>>> cannot solve this. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I'm not ready to accept this. Iceberg already supports a set of >>>>>>>>>> semantics around multiple writers committing simultaneously and how >>>>>>>>>> conflict resolution is done. The same can be done here. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> MVCC (which is what Iceberg tries to implement) requires a total >>>>>>>>>> ordering of snapshots. Also the snapshots need to be >>>>>>>>>> non-conflicting. I >>>>>>>>>> really don't see how any metadata data structures can solve this >>>>>>>>>> without an >>>>>>>>>> outside coordinator. >>>>>>>>>> >>>>>>>>>> Consider this: >>>>>>>>>> >>>>>>>>>> Snapshot 0: (K,A) = 1 >>>>>>>>>> Job X: UPDATE K SET A=A+1 >>>>>>>>>> Job Y: UPDATE K SET A=10 >>>>>>>>>> >>>>>>>>>> What should the final value of A be and who decides ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> By single writer, I don't mean single process, I mean multiple >>>>>>>>>>> coordinated processes like Spark executors coordinated by Spark >>>>>>>>>>> driver. The >>>>>>>>>>> coordinator ensures that the data is pre-partitioned on >>>>>>>>>>> each executor, and the coordinator commits the snapshot. >>>>>>>>>>> >>>>>>>>>>> Note however that single writer job/multiple concurrent reader >>>>>>>>>>> jobs is perfectly feasible, i.e. it shouldn't be a problem to write >>>>>>>>>>> from a >>>>>>>>>>> Spark job and read from multiple Dremio queries concurrently (for >>>>>>>>>>> example) >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> :D This is still "single process" from my perspective. That >>>>>>>>>> process may be coordinating other processes to do distributed work >>>>>>>>>> but >>>>>>>>>> ultimately it is a single process. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Fair enough >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> I'm not sure what you mean exactly. If we can't enforce >>>>>>>>>>> uniqueness we shouldn't assume it. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I disagree. We can specify that as a requirement and state that >>>>>>>>>> you'll get unintended consequences if you provide your own keys and >>>>>>>>>> don't >>>>>>>>>> maintain this. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> There's no need for unintended consequences, we can specify >>>>>>>>>> consistent behaviour (and I believe the document says what that is) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> We do expect that most of the time the natural key is unique, >>>>>>>>>>> but the eager and lazy with natural key designs can handle >>>>>>>>>>> duplicates >>>>>>>>>>> consistently. Basically it's not a problem to have duplicate >>>>>>>>>>> natural keys, everything works fine. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> That heavily depends on how things are implemented. For example, >>>>>>>>>> we may write a bunch of code that generates internal data structures >>>>>>>>>> based >>>>>>>>>> on this expectation. If we have to support duplicate matches, all of >>>>>>>>>> sudden >>>>>>>>>> we can no longer size various data structures to improve performance >>>>>>>>>> and >>>>>>>>>> may be unable to preallocate memory associated with a guaranteed >>>>>>>>>> completion. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Again we need to operate on the assumption that this is a large >>>>>>>>>> scale distributed compute/remote storage scenario. Key matching is >>>>>>>>>> done >>>>>>>>>> with shuffles with data movement across the network, such >>>>>>>>>> optimizations >>>>>>>>>> would really have little impact on overall performance. Not to >>>>>>>>>> mention that >>>>>>>>>> most query engines would already optimize the shuffle already as >>>>>>>>>> much as it >>>>>>>>>> can be optimized. >>>>>>>>>> >>>>>>>>>> It is true that if actual duplicate keys would make the key >>>>>>>>>> matching join (anti-join) somewhat more expensive, however it can be >>>>>>>>>> done >>>>>>>>>> in such a way that if the keys are in practice unique the join is as >>>>>>>>>> efficient as it can be. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Let me try and clarify each point: >>>>>>>>>>> >>>>>>>>>>> - lookup for query or update on a non-(partition/bucket/sort) >>>>>>>>>>> key predicate implies scanning large amounts of data - because >>>>>>>>>>> these are >>>>>>>>>>> the only data structures that can narrow down the lookup, right ? >>>>>>>>>>> One could >>>>>>>>>>> argue that the min/max index (file skipping) can be applied to any >>>>>>>>>>> column, >>>>>>>>>>> but in reality if that column is not sorted the min/max intervals >>>>>>>>>>> can have >>>>>>>>>>> huge overlaps so it may be next to useless. >>>>>>>>>>> - remote storage - this is a critical architecture decision - >>>>>>>>>>> implementations on local storage imply a vastly different design >>>>>>>>>>> for the >>>>>>>>>>> entire system, storage and compute. >>>>>>>>>>> - deleting single records per snapshot is unfeasible in eager >>>>>>>>>>> but also particularly in the lazy design: each deletion creates a >>>>>>>>>>> very >>>>>>>>>>> small snapshot. Deleting 1 million records one at a time would >>>>>>>>>>> create 1 >>>>>>>>>>> million small files, and 1 million RPC calls. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Why is this unfeasible? If I have a dataset of 100mm files >>>>>>>>>> including 1mm small files, is that a major problem? It seems like >>>>>>>>>> your >>>>>>>>>> usecase isn't one where you want to support single record deletes >>>>>>>>>> but it is >>>>>>>>>> definitely something important to many people. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 100 mm total files or 1 mm files per dataset is definitely a >>>>>>>>>> problem on HDFS, and I believe on S3 too. Single key delete would >>>>>>>>>> work just >>>>>>>>>> fine, but it's simply not optimal to do that on remote storage. This >>>>>>>>>> is a >>>>>>>>>> very well known problem with HDFS, and one of the very reasons to >>>>>>>>>> have >>>>>>>>>> something like Iceberg in the first place. >>>>>>>>>> >>>>>>>>>> Basically the users would be able to do single key mutation, but >>>>>>>>>> it's not the use case we should be optimizing for, but it's really >>>>>>>>>> not >>>>>>>>>> advisable. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Eager is conceptually just lazy + compaction done, well, >>>>>>>>>>> eagerly. The logic for both is exactly the same, the trade-off is >>>>>>>>>>> just that >>>>>>>>>>> with eager you implicitly compact every time so that you don't do >>>>>>>>>>> any work >>>>>>>>>>> on read, while with lazy >>>>>>>>>>> you want to amortize the cost of compaction over multiple >>>>>>>>>>> snapshots. >>>>>>>>>>> >>>>>>>>>>> Basically there should be no difference between the two >>>>>>>>>>> conceptually, or with regard to keys, etc. The only difference is >>>>>>>>>>> some >>>>>>>>>>> mechanics in implementation. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think you have deconstruct the problem too much to say these >>>>>>>>>> are the same (or at least that is what I'm starting to think given >>>>>>>>>> this >>>>>>>>>> thread). It seems like real world implementation decisions (per our >>>>>>>>>> discussion here) are in conflict. For example, you just argued >>>>>>>>>> against >>>>>>>>>> having a 1mm arbitrary mutations but I think that is because you >>>>>>>>>> aren't >>>>>>>>>> thinking about things over time with a delta implementation. Having >>>>>>>>>> 10,000 >>>>>>>>>> mutations a day where we do delta compaction once a week >>>>>>>>>> >>>>>>>>>> and local file mappings (key to offset sparse bitmaps) seems like >>>>>>>>>> it could result in very good performance in a case where we're >>>>>>>>>> mutating >>>>>>>>>> small amounts of data. In this scenario, you may not do major >>>>>>>>>> compaction >>>>>>>>>> ever unless you get to a high enough percentage of records that have >>>>>>>>>> been >>>>>>>>>> deleted in the original dataset. That drives a very different set of >>>>>>>>>> implementation decisions from a situation where you're trying to >>>>>>>>>> restate an >>>>>>>>>> entire partition at once. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> We operate on 1 billion mutations per day at least. This is the >>>>>>>>>> problem Iceberg wants to solve, I believe it's stated upfront. >>>>>>>>>> 10000/day is >>>>>>>>>> not a big data problem. It can be done fairly trivially and it would >>>>>>>>>> be >>>>>>>>>> supported, but there's not much point in extra optimizing for this >>>>>>>>>> use case >>>>>>>>>> I believe. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >> >> > > -- > Ryan Blue > Software Engineer > Netflix >