Yes, I think we should. I was going to propose one after catching up on the rest of this thread today.
On Wed, May 22, 2019 at 9:08 AM Anton Okolnychyi <aokolnyc...@apple.com> wrote: > Thanks! Would it make sense to discuss the agenda in advance? > > On 22 May 2019, at 17:04, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > I sent out an invite and included everyone on this thread. If anyone else > would like to join, please join the Zoom meeting. If you'd like to be added > to the calendar invite, just let me know and I'll add you. > > On Wed, May 22, 2019 at 8:57 AM Jacques Nadeau <jacq...@dremio.com> wrote: > >> works for me. >> >> To make things easier, we can use my zoom meeting if people like: >> >> Join Zoom Meeting >> https://zoom.us/j/4157302092 >> >> One tap mobile >> +16465588656,,4157302092# US (New York) >> +16699006833,,4157302092# US (San Jose) >> >> Dial by your location >> +1 646 558 8656 US (New York) >> +1 669 900 6833 US (San Jose) >> 877 853 5257 US Toll-free >> 888 475 4499 US Toll-free >> Meeting ID: 415 730 2092 >> Find your local number: https://zoom.us/u/aH9XYBfm >> >> >> >> >> >> >> -- >> Jacques Nadeau >> CTO and Co-Founder, Dremio >> >> >> On Wed, May 22, 2019 at 8:54 AM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> 9AM on Friday works best for me. How about then? >>> >>> On Wed, May 22, 2019 at 5:05 AM Anton Okolnychyi <aokolnyc...@apple.com> >>> wrote: >>> >>>> What about this Friday? One hour slot from 9:00 to 10:00 am or 10:00 to >>>> 11:00 am PST? Some folks are based in London, so meeting later than this is >>>> hard. If Friday doesn’t work, we can consider Tuesday or Wednesday next >>>> week. >>>> >>>> On 22 May 2019, at 00:54, Jacques Nadeau <jacq...@dremio.com> wrote: >>>> >>>> I agree with Anton that we should probably spend some time on hangouts >>>> further discussing things. Definitely differing expectations here and we >>>> seem to be talking a bit past each other. >>>> -- >>>> Jacques Nadeau >>>> CTO and Co-Founder, Dremio >>>> >>>> >>>> On Tue, May 21, 2019 at 3:44 PM Cristian Opris < >>>> cop...@apple.com.invalid> wrote: >>>> >>>>> I love a good flame war :P >>>>> >>>>> On 21 May 2019, at 22:57, Jacques Nadeau <jacq...@dremio.com> wrote: >>>>> >>>>> >>>>> That's my point, truly independent writers (two Spark jobs, or a Spark >>>>>> job and Dremio job) means a distributed transaction. It would need yet >>>>>> another external transaction coordinator on top of both Spark and Dremio, >>>>>> Iceberg by itself >>>>>> cannot solve this. >>>>>> >>>>> >>>>> I'm not ready to accept this. Iceberg already supports a set of >>>>> semantics around multiple writers committing simultaneously and how >>>>> conflict resolution is done. The same can be done here. >>>>> >>>>> >>>>> >>>>> >>>>> MVCC (which is what Iceberg tries to implement) requires a total >>>>> ordering of snapshots. Also the snapshots need to be non-conflicting. I >>>>> really don't see how any metadata data structures can solve this without >>>>> an >>>>> outside coordinator. >>>>> >>>>> Consider this: >>>>> >>>>> Snapshot 0: (K,A) = 1 >>>>> Job X: UPDATE K SET A=A+1 >>>>> Job Y: UPDATE K SET A=10 >>>>> >>>>> What should the final value of A be and who decides ? >>>>> >>>>> >>>>> >>>>>> By single writer, I don't mean single process, I mean multiple >>>>>> coordinated processes like Spark executors coordinated by Spark driver. >>>>>> The >>>>>> coordinator ensures that the data is pre-partitioned on >>>>>> each executor, and the coordinator commits the snapshot. >>>>>> >>>>>> Note however that single writer job/multiple concurrent reader jobs >>>>>> is perfectly feasible, i.e. it shouldn't be a problem to write from a >>>>>> Spark >>>>>> job and read from multiple Dremio queries concurrently (for example) >>>>>> >>>>> >>>>> :D This is still "single process" from my perspective. That process >>>>> may be coordinating other processes to do distributed work but ultimately >>>>> it is a single process. >>>>> >>>>> >>>>> Fair enough >>>>> >>>>> >>>>> >>>>>> I'm not sure what you mean exactly. If we can't enforce uniqueness we >>>>>> shouldn't assume it. >>>>>> >>>>> >>>>> I disagree. We can specify that as a requirement and state that you'll >>>>> get unintended consequences if you provide your own keys and don't >>>>> maintain >>>>> this. >>>>> >>>>> >>>>> There's no need for unintended consequences, we can specify consistent >>>>> behaviour (and I believe the document says what that is) >>>>> >>>>> >>>>> >>>>> >>>>>> We do expect that most of the time the natural key is unique, but the >>>>>> eager and lazy with natural key designs can handle duplicates >>>>>> consistently. Basically it's not a problem to have duplicate natural >>>>>> keys, everything works fine. >>>>>> >>>>> >>>>> That heavily depends on how things are implemented. For example, we >>>>> may write a bunch of code that generates internal data structures based on >>>>> this expectation. If we have to support duplicate matches, all of sudden >>>>> we >>>>> can no longer size various data structures to improve performance and may >>>>> be unable to preallocate memory associated with a guaranteed completion. >>>>> >>>>> >>>>> Again we need to operate on the assumption that this is a large scale >>>>> distributed compute/remote storage scenario. Key matching is done with >>>>> shuffles with data movement across the network, such optimizations would >>>>> really have little impact on overall performance. Not to mention that most >>>>> query engines would already optimize the shuffle already as much as it can >>>>> be optimized. >>>>> >>>>> It is true that if actual duplicate keys would make the key matching >>>>> join (anti-join) somewhat more expensive, however it can be done in such a >>>>> way that if the keys are in practice unique the join is as efficient as it >>>>> can be. >>>>> >>>>> >>>>> >>>>> Let me try and clarify each point: >>>>>> >>>>>> - lookup for query or update on a non-(partition/bucket/sort) key >>>>>> predicate implies scanning large amounts of data - because these are the >>>>>> only data structures that can narrow down the lookup, right ? One could >>>>>> argue that the min/max index (file skipping) can be applied to any >>>>>> column, >>>>>> but in reality if that column is not sorted the min/max intervals can >>>>>> have >>>>>> huge overlaps so it may be next to useless. >>>>>> - remote storage - this is a critical architecture decision - >>>>>> implementations on local storage imply a vastly different design for the >>>>>> entire system, storage and compute. >>>>>> - deleting single records per snapshot is unfeasible in eager but >>>>>> also particularly in the lazy design: each deletion creates a very small >>>>>> snapshot. Deleting 1 million records one at a time would create 1 million >>>>>> small files, and 1 million RPC calls. >>>>>> >>>>> >>>>> Why is this unfeasible? If I have a dataset of 100mm files including >>>>> 1mm small files, is that a major problem? It seems like your usecase isn't >>>>> one where you want to support single record deletes but it is definitely >>>>> something important to many people. >>>>> >>>>> >>>>> 100 mm total files or 1 mm files per dataset is definitely a problem >>>>> on HDFS, and I believe on S3 too. Single key delete would work just fine, >>>>> but it's simply not optimal to do that on remote storage. This is a very >>>>> well known problem with HDFS, and one of the very reasons to have >>>>> something >>>>> like Iceberg in the first place. >>>>> >>>>> Basically the users would be able to do single key mutation, but it's >>>>> not the use case we should be optimizing for, but it's really not >>>>> advisable. >>>>> >>>>> >>>>> >>>>> >>>>>> Eager is conceptually just lazy + compaction done, well, eagerly. The >>>>>> logic for both is exactly the same, the trade-off is just that with eager >>>>>> you implicitly compact every time so that you don't do any work on read, >>>>>> while with lazy >>>>>> you want to amortize the cost of compaction over multiple snapshots. >>>>>> >>>>>> Basically there should be no difference between the two conceptually, >>>>>> or with regard to keys, etc. The only difference is some mechanics in >>>>>> implementation. >>>>>> >>>>> >>>>> I think you have deconstruct the problem too much to say these are the >>>>> same (or at least that is what I'm starting to think given this thread). >>>>> It >>>>> seems like real world implementation decisions (per our discussion here) >>>>> are in conflict. For example, you just argued against having a 1mm >>>>> arbitrary mutations but I think that is because you aren't thinking about >>>>> things over time with a delta implementation. Having 10,000 mutations a >>>>> day >>>>> where we do delta compaction once a week >>>>> >>>>> and local file mappings (key to offset sparse bitmaps) seems like it >>>>> could result in very good performance in a case where we're mutating small >>>>> amounts of data. In this scenario, you may not do major compaction ever >>>>> unless you get to a high enough percentage of records that have been >>>>> deleted in the original dataset. That drives a very different set of >>>>> implementation decisions from a situation where you're trying to restate >>>>> an >>>>> entire partition at once. >>>>> >>>>> >>>>> We operate on 1 billion mutations per day at least. This is the >>>>> problem Iceberg wants to solve, I believe it's stated upfront. 10000/day >>>>> is >>>>> not a big data problem. It can be done fairly trivially and it would be >>>>> supported, but there's not much point in extra optimizing for this use >>>>> case >>>>> I believe. >>>>> >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix > > > -- Ryan Blue Software Engineer Netflix