Upserts in Iceberg

Venkatakrishnan Sowrirajan Wed, 29 May 2019 15:36:00 -0700

Hi Ryan,

I couldn't attend the meeting. Just curious, if this is recorded by any
chance.


Regards
Venkata krishnan


On Fri, May 24, 2019 at 8:49 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Yes, I agree. I'll talk a little about a couple of the constraints of this
> as well.
>
> On Fri, May 24, 2019 at 5:52 AM Anton Okolnychyi <aokolnyc...@apple.com>
> wrote:
>
>> The agenda looks good to me. I think it would also make sense to clarify
>> the responsibilities of query engines and Iceberg. Not only in terms of
>> uniqueness, but also in terms of applying diffs on read, for example.
>>
>> On 23 May 2019, at 01:59, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>
>> Here’s a rough agenda:
>>
>>    - Use cases: everyone come with a use case that you’d like to have
>>    supported. We’ll go around and introduce ourselves and our use cases.
>>    - Main topic: How should Iceberg identify rows that are deleted?
>>    - Side topics from my initial email, if we have time: should we use
>>    insert diffs, should we support dense and sparse formats, etc.
>>
>> The main topic I think we should discuss is: *How should Iceberg
>> identify rows that are deleted?*
>>
>> I’m phrasing it this way to avoid where I think we’re talking past one
>> another because we are making assumptions. The important thing is that
>> there are two main options:
>>
>>    - Filename and position, vs
>>    - Specific values of (few) columns in the data
>>
>> This phrasing also avoids discussing uniqueness constraints. Once we get
>> down to behavior, I think we agree. For example, I think we all agree that
>> uniqueness cannot be enforced in Iceberg.
>>
>> If uniqueness can’t be enforced in Iceberg, the main choice comes down to
>> how we identify rows that are deleted. If we use (filename, position) then
>> we know that there is only one row. On the other hand, if we use data
>> values to identify rows then a delete may identify more than one row
>> because there are no uniqueness guarantees. I think we also agree that if
>> there is more than one row identified, all of them should be deleted.
>>
>> At that point, there are trade-offs between the approaches:
>>
>>    - When identifying deleted rows by data values, situations like the
>>    one that Anton pointed out are possible.
>>    - Jacques also had a good point about concurrency. If at all
>>    possible, we want to be able to reconcile changes between concurrent
>>    commits without re-running an operation.
>>
>> Sound like a reasonable amount to talk through?
>>
>> rb
>>
>> On Wed, May 22, 2019 at 1:17 PM Erik Wright <erik.wri...@shopify.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, May 22, 2019 at 4:04 PM Cristian Opris <cop...@apple.com.invalid>
>>> wrote:
>>>
>>>> Agreed with Erik here, we're certainly not looking to build the
>>>> equivalent of a relational database, and for that matter not even that of a
>>>> local disk storage analytics database (like Vertica). Those are very
>>>> different designs with very different trade-offs and optimizations.
>>>>
>>>> We're looking to automate and optimize specific types of file
>>>> manipulation for large files on remote storage, while presenting that to
>>>> the user under the common SQL API for *bulk* data manipulation (MERGE
>>>> INTO)
>>>>
>>>
>>> What I would encourage is to decouple the storage model from the
>>> implementation of that API. If Iceberg has support for merge-on-read of
>>> upserts and deletes, in addition to its powerful support for partitioning,
>>> it will be easy for a higher-level application to implement those APIs
>>> given certain other constraints (that might not be appropriate to all
>>> applications).
>>>
>>> Myself and Miguel are out on Friday, but Anton should be able to handle
>>>> the discussion on our side.
>>>>
>>>>
>>>> Thanks,
>>>> Cristian
>>>>
>>>>
>>>> On 22 May 2019, at 17:51, Erik Wright <erik.wri...@shopify.com.INVALID>
>>>> wrote:
>>>>
>>>> We have two rows with the same natural key and we use that natural key
>>>>> in diff files:
>>>>> nk | col1 | col2
>>>>> 1 | 1 | 1
>>>>> 1 | 2 | 2
>>>>> Then we have a delete statement:
>>>>> DELETE FROM t WHERE col1 = 1
>>>>
>>>>
>>>> I think this example cuts to the point of the differences of
>>>> understanding. Does Iceberg want to be approaching the utility of a
>>>> relational database, against which I can execute complex update queries?
>>>> This is not what I would have imagined.
>>>>
>>>> I would have, instead, imagined that it was up to the client to
>>>> identify, through whatever means, that they want to update or delete a row
>>>> with a given ID. If there are multiple (distinct) rows with the same ID,
>>>> _too bad_. Any user should _expect_ that they could potentially see any one
>>>> or more of those rows at read time. And that an upsert/delete would affect
>>>> any/all of them (I would argue for all).
>>>>
>>>> *In summary:* Instead of trying to come up with a consistent, logical
>>>> handling for complex queries that are best suited for a relational
>>>> database, leave such handling up to the client and concentrate on problems
>>>> that can be solved simply and more generally.
>>>>
>>>> On Wed, May 22, 2019 at 12:11 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Yes, I think we should. I was going to propose one after catching up
>>>>> on the rest of this thread today.
>>>>>
>>>>> On Wed, May 22, 2019 at 9:08 AM Anton Okolnychyi <
>>>>> aokolnyc...@apple.com> wrote:
>>>>>
>>>>>> Thanks! Would it make sense to discuss the agenda in advance?
>>>>>>
>>>>>> On 22 May 2019, at 17:04, Ryan Blue <rb...@netflix.com.INVALID>
>>>>>> wrote:
>>>>>>
>>>>>> I sent out an invite and included everyone on this thread. If anyone
>>>>>> else would like to join, please join the Zoom meeting. If you'd like to 
>>>>>> be
>>>>>> added to the calendar invite, just let me know and I'll add you.
>>>>>>
>>>>>> On Wed, May 22, 2019 at 8:57 AM Jacques Nadeau <jacq...@dremio.com>
>>>>>> wrote:
>>>>>>
>>>>>>> works for me.
>>>>>>>
>>>>>>> To make things easier, we can use my zoom meeting if people like:
>>>>>>>
>>>>>>> Join Zoom Meeting
>>>>>>> https://zoom.us/j/4157302092
>>>>>>>
>>>>>>> One tap mobile
>>>>>>> +16465588656,,4157302092# US (New York)
>>>>>>> +16699006833,,4157302092# US (San Jose)
>>>>>>>
>>>>>>> Dial by your location
>>>>>>>         +1 646 558 8656 US (New York)
>>>>>>>         +1 669 900 6833 US (San Jose)
>>>>>>>         877 853 5257 US Toll-free
>>>>>>>         888 475 4499 US Toll-free
>>>>>>> Meeting ID: 415 730 2092
>>>>>>> Find your local number: https://zoom.us/u/aH9XYBfm
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jacques Nadeau
>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 22, 2019 at 8:54 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> 9AM on Friday works best for me. How about then?
>>>>>>>>
>>>>>>>> On Wed, May 22, 2019 at 5:05 AM Anton Okolnychyi <
>>>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>>>
>>>>>>>>> What about this Friday? One hour slot from 9:00 to 10:00 am or
>>>>>>>>> 10:00 to 11:00 am PST? Some folks are based in London, so meeting 
>>>>>>>>> later
>>>>>>>>> than this is hard. If Friday doesn’t work, we can consider Tuesday or
>>>>>>>>> Wednesday next week.
>>>>>>>>>
>>>>>>>>> On 22 May 2019, at 00:54, Jacques Nadeau <jacq...@dremio.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I agree with Anton that we should probably spend some time on
>>>>>>>>> hangouts further discussing things. Definitely differing expectations 
>>>>>>>>> here
>>>>>>>>> and we seem to be talking a bit past each other.
>>>>>>>>> --
>>>>>>>>> Jacques Nadeau
>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 21, 2019 at 3:44 PM Cristian Opris <
>>>>>>>>> cop...@apple.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> I love a good flame war :P
>>>>>>>>>>
>>>>>>>>>> On 21 May 2019, at 22:57, Jacques Nadeau <jacq...@dremio.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That's my point, truly independent writers (two Spark jobs, or a
>>>>>>>>>>> Spark job and Dremio job) means a distributed transaction. It would 
>>>>>>>>>>> need
>>>>>>>>>>> yet another external transaction coordinator on top of both Spark 
>>>>>>>>>>> and
>>>>>>>>>>> Dremio, Iceberg by itself
>>>>>>>>>>> cannot solve this.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm not ready to accept this. Iceberg already supports a set of
>>>>>>>>>> semantics around multiple writers committing simultaneously and how
>>>>>>>>>> conflict resolution is done. The same can be done here.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> MVCC (which is what Iceberg tries to implement) requires a total
>>>>>>>>>> ordering of snapshots. Also the snapshots need to be 
>>>>>>>>>> non-conflicting. I
>>>>>>>>>> really don't see how any metadata data structures can solve this 
>>>>>>>>>> without an
>>>>>>>>>> outside coordinator.
>>>>>>>>>>
>>>>>>>>>> Consider this:
>>>>>>>>>>
>>>>>>>>>> Snapshot 0: (K,A) = 1
>>>>>>>>>> Job X: UPDATE K SET A=A+1
>>>>>>>>>> Job Y: UPDATE K SET A=10
>>>>>>>>>>
>>>>>>>>>> What should the final value of A be and who decides ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> By single writer, I don't mean single process, I mean multiple
>>>>>>>>>>> coordinated processes like Spark executors coordinated by Spark 
>>>>>>>>>>> driver. The
>>>>>>>>>>> coordinator ensures that the data is pre-partitioned on
>>>>>>>>>>> each executor, and the coordinator commits the snapshot.
>>>>>>>>>>>
>>>>>>>>>>> Note however that single writer job/multiple concurrent reader
>>>>>>>>>>> jobs is perfectly feasible, i.e. it shouldn't be a problem to write 
>>>>>>>>>>> from a
>>>>>>>>>>> Spark job and read from multiple Dremio queries concurrently (for 
>>>>>>>>>>> example)
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> :D This is still "single process" from my perspective. That
>>>>>>>>>> process may be coordinating other processes to do distributed work 
>>>>>>>>>> but
>>>>>>>>>> ultimately it is a single process.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Fair enough
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I'm not sure what you mean exactly. If we can't enforce
>>>>>>>>>>> uniqueness we shouldn't assume it.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I disagree. We can specify that as a requirement and state that
>>>>>>>>>> you'll get unintended consequences if you provide your own keys and 
>>>>>>>>>> don't
>>>>>>>>>> maintain this.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There's no need for unintended consequences, we can specify
>>>>>>>>>> consistent behaviour (and I believe the document says what that is)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> We do expect that most of the time the natural key is unique,
>>>>>>>>>>> but the eager and lazy with natural key designs can handle 
>>>>>>>>>>> duplicates
>>>>>>>>>>> consistently. Basically it's not a problem to have duplicate
>>>>>>>>>>> natural keys, everything works fine.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That heavily depends on how things are implemented. For example,
>>>>>>>>>> we may write a bunch of code that generates internal data structures 
>>>>>>>>>> based
>>>>>>>>>> on this expectation. If we have to support duplicate matches, all of 
>>>>>>>>>> sudden
>>>>>>>>>> we can no longer size various data structures to improve performance 
>>>>>>>>>> and
>>>>>>>>>> may be unable to preallocate memory associated with a guaranteed 
>>>>>>>>>> completion.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Again we need to operate on the assumption that this is a large
>>>>>>>>>> scale distributed compute/remote storage scenario. Key matching is 
>>>>>>>>>> done
>>>>>>>>>> with shuffles with data movement across the network, such 
>>>>>>>>>> optimizations
>>>>>>>>>> would really have little impact on overall performance. Not to 
>>>>>>>>>> mention that
>>>>>>>>>> most query engines would already optimize the shuffle already as 
>>>>>>>>>> much as it
>>>>>>>>>> can be optimized.
>>>>>>>>>>
>>>>>>>>>> It is true that if actual duplicate keys would make the key
>>>>>>>>>> matching join (anti-join) somewhat more expensive, however it can be 
>>>>>>>>>> done
>>>>>>>>>> in such a way that if the keys are in practice unique the join is as
>>>>>>>>>> efficient as it can be.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Let me try and clarify each point:
>>>>>>>>>>>
>>>>>>>>>>> - lookup for query or update on a non-(partition/bucket/sort)
>>>>>>>>>>> key predicate implies scanning large amounts of data - because 
>>>>>>>>>>> these are
>>>>>>>>>>> the only data structures that can narrow down the lookup, right ? 
>>>>>>>>>>> One could
>>>>>>>>>>> argue that the min/max index (file skipping) can be applied to any 
>>>>>>>>>>> column,
>>>>>>>>>>> but in reality if that column is not sorted the min/max intervals 
>>>>>>>>>>> can have
>>>>>>>>>>> huge overlaps so it may be next to useless.
>>>>>>>>>>> - remote storage - this is a critical architecture decision -
>>>>>>>>>>> implementations on local storage imply a vastly different design 
>>>>>>>>>>> for the
>>>>>>>>>>> entire system, storage and compute.
>>>>>>>>>>> - deleting single records per snapshot is unfeasible in eager
>>>>>>>>>>> but also particularly in the lazy design: each deletion creates a 
>>>>>>>>>>> very
>>>>>>>>>>> small snapshot. Deleting 1 million records one at a time would 
>>>>>>>>>>> create 1
>>>>>>>>>>> million small files, and 1 million RPC calls.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Why is this unfeasible? If I have a dataset of 100mm files
>>>>>>>>>> including 1mm small files, is that a major problem? It seems like 
>>>>>>>>>> your
>>>>>>>>>> usecase isn't one where you want to support single record deletes 
>>>>>>>>>> but it is
>>>>>>>>>> definitely something important to many people.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 100 mm total files or 1 mm files per dataset is definitely a
>>>>>>>>>> problem on HDFS, and I believe on S3 too. Single key delete would 
>>>>>>>>>> work just
>>>>>>>>>> fine, but it's simply not optimal to do that on remote storage. This 
>>>>>>>>>> is a
>>>>>>>>>> very well known problem with HDFS, and one of the very reasons to 
>>>>>>>>>> have
>>>>>>>>>> something like Iceberg in the first place.
>>>>>>>>>>
>>>>>>>>>> Basically the users would be able to do single key mutation, but
>>>>>>>>>> it's not the use case we should be optimizing for, but it's really 
>>>>>>>>>> not
>>>>>>>>>> advisable.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Eager is conceptually just lazy + compaction done, well,
>>>>>>>>>>> eagerly. The logic for both is exactly the same, the trade-off is 
>>>>>>>>>>> just that
>>>>>>>>>>> with eager you implicitly compact every time so that you don't do 
>>>>>>>>>>> any work
>>>>>>>>>>> on read, while with lazy
>>>>>>>>>>> you want to amortize the cost of compaction over multiple
>>>>>>>>>>> snapshots.
>>>>>>>>>>>
>>>>>>>>>>> Basically there should be no difference between the two
>>>>>>>>>>> conceptually, or with regard to keys, etc. The only difference is 
>>>>>>>>>>> some
>>>>>>>>>>> mechanics in implementation.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think you have deconstruct the problem too much to say these
>>>>>>>>>> are the same (or at least that is what I'm starting to think given 
>>>>>>>>>> this
>>>>>>>>>> thread). It seems like real world implementation decisions (per our
>>>>>>>>>> discussion here) are in conflict. For example, you just argued 
>>>>>>>>>> against
>>>>>>>>>> having a 1mm arbitrary mutations but I think that is because you 
>>>>>>>>>> aren't
>>>>>>>>>> thinking about things over time with a delta implementation. Having 
>>>>>>>>>> 10,000
>>>>>>>>>> mutations a day where we do delta compaction once a week
>>>>>>>>>>
>>>>>>>>>> and local file mappings (key to offset sparse bitmaps) seems like
>>>>>>>>>> it could result in very good performance in a case where we're 
>>>>>>>>>> mutating
>>>>>>>>>> small amounts of data. In this scenario, you may not do major 
>>>>>>>>>> compaction
>>>>>>>>>> ever unless you get to a high enough percentage of records that have 
>>>>>>>>>> been
>>>>>>>>>> deleted in the original dataset. That drives a very different set of
>>>>>>>>>> implementation decisions from a situation where you're trying to 
>>>>>>>>>> restate an
>>>>>>>>>> entire partition at once.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We operate on 1 billion mutations per day at least. This is the
>>>>>>>>>> problem Iceberg wants to solve, I believe it's stated upfront. 
>>>>>>>>>> 10000/day is
>>>>>>>>>> not a big data problem. It can be done fairly trivially and it would 
>>>>>>>>>> be
>>>>>>>>>> supported, but there's not much point in extra optimizing for this 
>>>>>>>>>> use case
>>>>>>>>>> I believe.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to