Upserts in Iceberg

Ryan Blue Fri, 07 Jun 2019 11:49:00 -0700

Thanks, Erik! Great to see progress here. I'll set aside some time to look
this over in detail.


On Fri, Jun 7, 2019 at 11:46 AM Erik Wright <erik.wri...@shopify.com> wrote:

> I apologize for the delay, but I have finally put together a document
> describing an alternative approach to supporting updates in Iceberg while
> minimizing write amplification.
>
> Proposal: Iceberg Merge-on-Read
> <https://docs.google.com/document/d/1KuOMeS8Hw_yuE5IXtII8EClJlEMtqFECfWbg-gN5lsQ/edit?usp=sharing>
>
> Thank you, Anton and Miguel, for starting this conversation, and everyone
> else as well for the ongoing dialogue. I'm looking forward to continuing to
> discuss this and hopefully finding an approach that can meet our different
> needs and that we can work together on.
>
> Cheers,
>
> Erik
>
> On Wed, May 29, 2019 at 7:26 PM Jacques Nadeau <jacq...@dremio.com> wrote:
>
>> Yeah, I totally forgot to record our discussion. Will do so next time,
>> sorry.
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Wed, May 29, 2019 at 4:24 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> It wasn't recorded, but I can summarize what we talked about. Sorry I
>>> haven't sent this out earlier.
>>>
>>> We talked about the options and some of the background in Iceberg --
>>> basically that it isn't possible to determine the order of commits before
>>> you commit so you can't rely on some monotonically increasing value from a
>>> snapshot to know which deltas to apply to a file. The result is that we
>>> can't apply diffs to data files using a rule like "files older than X"
>>> because we can't identify those files without the snapshot history.
>>>
>>> That gives us basically 2 options for scoping delete diffs: either
>>> identify the files to apply a diff to when writing the diff, or log changes
>>> applied to a snapshot and keep the snapshot history around (which is how we
>>> know the order of snapshots). The first option is not good if you want to
>>> write without reading data to determine where the deleted records are. The
>>> second prevents cleaning up snapshot history.
>>>
>>> We also talked about whether we should encode IDs in data files. Jacques
>>> pointed out that retrying a commit is easier if you don't need to re-read
>>> the original data to reconcile changes. For example, if a data file was
>>> compacted in a concurrent write, how do we reconcile a delete for it? We
>>> discussed other options, like rolling back the compaction for delete
>>> events. I think that's a promising option.
>>>
>>> For action items, Jacques was going to think about whether we need to
>>> encode IDs in data files or if we could use positions to identify rows and
>>> write up a summary/proposal. Erik was going to take on planning how
>>> identifying rows without reading data would work and similarly write up a
>>> summary/proposal.
>>>
>>> That's from memory, so if I've missed anything, I hope that other
>>> attendees will fill in the details!
>>>
>>> rb
>>>
>>> On Wed, May 29, 2019 at 3:34 PM Venkatakrishnan Sowrirajan <
>>> vsowr...@asu.edu> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> I couldn't attend the meeting. Just curious, if this is recorded by any
>>>> chance.
>>>>
>>>> Regards
>>>> Venkata krishnan
>>>>
>>>>
>>>> On Fri, May 24, 2019 at 8:49 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Yes, I agree. I'll talk a little about a couple of the constraints of
>>>>> this as well.
>>>>>
>>>>> On Fri, May 24, 2019 at 5:52 AM Anton Okolnychyi <
>>>>> aokolnyc...@apple.com> wrote:
>>>>>
>>>>>> The agenda looks good to me. I think it would also make sense to
>>>>>> clarify the responsibilities of query engines and Iceberg. Not only in
>>>>>> terms of uniqueness, but also in terms of applying diffs on read, for
>>>>>> example.
>>>>>>
>>>>>> On 23 May 2019, at 01:59, Ryan Blue <rb...@netflix.com.INVALID>
>>>>>> wrote:
>>>>>>
>>>>>> Here’s a rough agenda:
>>>>>>
>>>>>>    - Use cases: everyone come with a use case that you’d like to
>>>>>>    have supported. We’ll go around and introduce ourselves and our use 
>>>>>> cases.
>>>>>>    - Main topic: How should Iceberg identify rows that are deleted?
>>>>>>    - Side topics from my initial email, if we have time: should we
>>>>>>    use insert diffs, should we support dense and sparse formats, etc.
>>>>>>
>>>>>> The main topic I think we should discuss is: *How should Iceberg
>>>>>> identify rows that are deleted?*
>>>>>>
>>>>>> I’m phrasing it this way to avoid where I think we’re talking past
>>>>>> one another because we are making assumptions. The important thing is 
>>>>>> that
>>>>>> there are two main options:
>>>>>>
>>>>>>    - Filename and position, vs
>>>>>>    - Specific values of (few) columns in the data
>>>>>>
>>>>>> This phrasing also avoids discussing uniqueness constraints. Once we
>>>>>> get down to behavior, I think we agree. For example, I think we all agree
>>>>>> that uniqueness cannot be enforced in Iceberg.
>>>>>>
>>>>>> If uniqueness can’t be enforced in Iceberg, the main choice comes
>>>>>> down to how we identify rows that are deleted. If we use (filename,
>>>>>> position) then we know that there is only one row. On the other hand, if 
>>>>>> we
>>>>>> use data values to identify rows then a delete may identify more than one
>>>>>> row because there are no uniqueness guarantees. I think we also agree 
>>>>>> that
>>>>>> if there is more than one row identified, all of them should be deleted.
>>>>>>
>>>>>> At that point, there are trade-offs between the approaches:
>>>>>>
>>>>>>    - When identifying deleted rows by data values, situations like
>>>>>>    the one that Anton pointed out are possible.
>>>>>>    - Jacques also had a good point about concurrency. If at all
>>>>>>    possible, we want to be able to reconcile changes between concurrent
>>>>>>    commits without re-running an operation.
>>>>>>
>>>>>> Sound like a reasonable amount to talk through?
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Wed, May 22, 2019 at 1:17 PM Erik Wright <erik.wri...@shopify.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 22, 2019 at 4:04 PM Cristian Opris <
>>>>>>> cop...@apple.com.invalid> wrote:
>>>>>>>
>>>>>>>> Agreed with Erik here, we're certainly not looking to build the
>>>>>>>> equivalent of a relational database, and for that matter not even that 
>>>>>>>> of a
>>>>>>>> local disk storage analytics database (like Vertica). Those are very
>>>>>>>> different designs with very different trade-offs and optimizations.
>>>>>>>>
>>>>>>>> We're looking to automate and optimize specific types of file
>>>>>>>> manipulation for large files on remote storage, while presenting that 
>>>>>>>> to
>>>>>>>> the user under the common SQL API for *bulk* data manipulation
>>>>>>>> (MERGE INTO)
>>>>>>>>
>>>>>>>
>>>>>>> What I would encourage is to decouple the storage model from the
>>>>>>> implementation of that API. If Iceberg has support for merge-on-read of
>>>>>>> upserts and deletes, in addition to its powerful support for 
>>>>>>> partitioning,
>>>>>>> it will be easy for a higher-level application to implement those APIs
>>>>>>> given certain other constraints (that might not be appropriate to all
>>>>>>> applications).
>>>>>>>
>>>>>>> Myself and Miguel are out on Friday, but Anton should be able to
>>>>>>>> handle the discussion on our side.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Cristian
>>>>>>>>
>>>>>>>>
>>>>>>>> On 22 May 2019, at 17:51, Erik Wright <
>>>>>>>> erik.wri...@shopify.com.INVALID> wrote:
>>>>>>>>
>>>>>>>> We have two rows with the same natural key and we use that natural
>>>>>>>>> key in diff files:
>>>>>>>>> nk | col1 | col2
>>>>>>>>> 1 | 1 | 1
>>>>>>>>> 1 | 2 | 2
>>>>>>>>> Then we have a delete statement:
>>>>>>>>> DELETE FROM t WHERE col1 = 1
>>>>>>>>
>>>>>>>>
>>>>>>>> I think this example cuts to the point of the differences of
>>>>>>>> understanding. Does Iceberg want to be approaching the utility of a
>>>>>>>> relational database, against which I can execute complex update 
>>>>>>>> queries?
>>>>>>>> This is not what I would have imagined.
>>>>>>>>
>>>>>>>> I would have, instead, imagined that it was up to the client to
>>>>>>>> identify, through whatever means, that they want to update or delete a 
>>>>>>>> row
>>>>>>>> with a given ID. If there are multiple (distinct) rows with the same 
>>>>>>>> ID,
>>>>>>>> _too bad_. Any user should _expect_ that they could potentially see 
>>>>>>>> any one
>>>>>>>> or more of those rows at read time. And that an upsert/delete would 
>>>>>>>> affect
>>>>>>>> any/all of them (I would argue for all).
>>>>>>>>
>>>>>>>> *In summary:* Instead of trying to come up with a consistent,
>>>>>>>> logical handling for complex queries that are best suited for a 
>>>>>>>> relational
>>>>>>>> database, leave such handling up to the client and concentrate on 
>>>>>>>> problems
>>>>>>>> that can be solved simply and more generally.
>>>>>>>>
>>>>>>>> On Wed, May 22, 2019 at 12:11 PM Ryan Blue <
>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Yes, I think we should. I was going to propose one after catching
>>>>>>>>> up on the rest of this thread today.
>>>>>>>>>
>>>>>>>>> On Wed, May 22, 2019 at 9:08 AM Anton Okolnychyi <
>>>>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks! Would it make sense to discuss the agenda in advance?
>>>>>>>>>>
>>>>>>>>>> On 22 May 2019, at 17:04, Ryan Blue <rb...@netflix.com.INVALID>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I sent out an invite and included everyone on this thread. If
>>>>>>>>>> anyone else would like to join, please join the Zoom meeting. If 
>>>>>>>>>> you'd like
>>>>>>>>>> to be added to the calendar invite, just let me know and I'll add 
>>>>>>>>>> you.
>>>>>>>>>>
>>>>>>>>>> On Wed, May 22, 2019 at 8:57 AM Jacques Nadeau <
>>>>>>>>>> jacq...@dremio.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> works for me.
>>>>>>>>>>>
>>>>>>>>>>> To make things easier, we can use my zoom meeting if people like:
>>>>>>>>>>>
>>>>>>>>>>> Join Zoom Meeting
>>>>>>>>>>> https://zoom.us/j/4157302092
>>>>>>>>>>>
>>>>>>>>>>> One tap mobile
>>>>>>>>>>> +16465588656,,4157302092# US (New York)
>>>>>>>>>>> +16699006833,,4157302092# US (San Jose)
>>>>>>>>>>>
>>>>>>>>>>> Dial by your location
>>>>>>>>>>>         +1 646 558 8656 US (New York)
>>>>>>>>>>>         +1 669 900 6833 US (San Jose)
>>>>>>>>>>>         877 853 5257 US Toll-free
>>>>>>>>>>>         888 475 4499 US Toll-free
>>>>>>>>>>> Meeting ID: 415 730 2092
>>>>>>>>>>> Find your local number: https://zoom.us/u/aH9XYBfm
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 22, 2019 at 8:54 AM Ryan Blue <
>>>>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> 9AM on Friday works best for me. How about then?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, May 22, 2019 at 5:05 AM Anton Okolnychyi <
>>>>>>>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What about this Friday? One hour slot from 9:00 to 10:00 am or
>>>>>>>>>>>>> 10:00 to 11:00 am PST? Some folks are based in London, so meeting 
>>>>>>>>>>>>> later
>>>>>>>>>>>>> than this is hard. If Friday doesn’t work, we can consider 
>>>>>>>>>>>>> Tuesday or
>>>>>>>>>>>>> Wednesday next week.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 22 May 2019, at 00:54, Jacques Nadeau <jacq...@dremio.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree with Anton that we should probably spend some time on
>>>>>>>>>>>>> hangouts further discussing things. Definitely differing 
>>>>>>>>>>>>> expectations here
>>>>>>>>>>>>> and we seem to be talking a bit past each other.
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 21, 2019 at 3:44 PM Cristian Opris <
>>>>>>>>>>>>> cop...@apple.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I love a good flame war :P
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 21 May 2019, at 22:57, Jacques Nadeau <jacq...@dremio.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That's my point, truly independent writers (two Spark jobs,
>>>>>>>>>>>>>>> or a Spark job and Dremio job) means a distributed transaction. 
>>>>>>>>>>>>>>> It would
>>>>>>>>>>>>>>> need yet another external transaction coordinator on top of 
>>>>>>>>>>>>>>> both Spark and
>>>>>>>>>>>>>>> Dremio, Iceberg by itself
>>>>>>>>>>>>>>> cannot solve this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not ready to accept this. Iceberg already supports a set
>>>>>>>>>>>>>> of semantics around multiple writers committing simultaneously 
>>>>>>>>>>>>>> and how
>>>>>>>>>>>>>> conflict resolution is done. The same can be done here.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> MVCC (which is what Iceberg tries to implement) requires a
>>>>>>>>>>>>>> total ordering of snapshots. Also the snapshots need to be 
>>>>>>>>>>>>>> non-conflicting.
>>>>>>>>>>>>>> I really don't see how any metadata data structures can solve 
>>>>>>>>>>>>>> this without
>>>>>>>>>>>>>> an outside coordinator.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Consider this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Snapshot 0: (K,A) = 1
>>>>>>>>>>>>>> Job X: UPDATE K SET A=A+1
>>>>>>>>>>>>>> Job Y: UPDATE K SET A=10
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What should the final value of A be and who decides ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> By single writer, I don't mean single process, I mean
>>>>>>>>>>>>>>> multiple coordinated processes like Spark executors coordinated 
>>>>>>>>>>>>>>> by Spark
>>>>>>>>>>>>>>> driver. The coordinator ensures that the data is 
>>>>>>>>>>>>>>> pre-partitioned on
>>>>>>>>>>>>>>> each executor, and the coordinator commits the snapshot.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Note however that single writer job/multiple concurrent
>>>>>>>>>>>>>>> reader jobs is perfectly feasible, i.e. it shouldn't be a 
>>>>>>>>>>>>>>> problem to write
>>>>>>>>>>>>>>> from a Spark job and read from multiple Dremio queries 
>>>>>>>>>>>>>>> concurrently (for
>>>>>>>>>>>>>>> example)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> :D This is still "single process" from my perspective. That
>>>>>>>>>>>>>> process may be coordinating other processes to do distributed 
>>>>>>>>>>>>>> work but
>>>>>>>>>>>>>> ultimately it is a single process.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fair enough
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure what you mean exactly. If we can't enforce
>>>>>>>>>>>>>>> uniqueness we shouldn't assume it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I disagree. We can specify that as a requirement and state
>>>>>>>>>>>>>> that you'll get unintended consequences if you provide your own 
>>>>>>>>>>>>>> keys and
>>>>>>>>>>>>>> don't maintain this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There's no need for unintended consequences, we can specify
>>>>>>>>>>>>>> consistent behaviour (and I believe the document says what that 
>>>>>>>>>>>>>> is)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We do expect that most of the time the natural key is
>>>>>>>>>>>>>>> unique, but the eager and lazy with natural key designs can 
>>>>>>>>>>>>>>> handle
>>>>>>>>>>>>>>> duplicates
>>>>>>>>>>>>>>> consistently. Basically it's not a problem to have duplicate
>>>>>>>>>>>>>>> natural keys, everything works fine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That heavily depends on how things are implemented. For
>>>>>>>>>>>>>> example, we may write a bunch of code that generates internal 
>>>>>>>>>>>>>> data
>>>>>>>>>>>>>> structures based on this expectation. If we have to support 
>>>>>>>>>>>>>> duplicate
>>>>>>>>>>>>>> matches, all of sudden we can no longer size various data 
>>>>>>>>>>>>>> structures to
>>>>>>>>>>>>>> improve performance and may be unable to preallocate memory 
>>>>>>>>>>>>>> associated with
>>>>>>>>>>>>>> a guaranteed completion.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Again we need to operate on the assumption that this is a
>>>>>>>>>>>>>> large scale distributed compute/remote storage scenario. Key 
>>>>>>>>>>>>>> matching is
>>>>>>>>>>>>>> done with shuffles with data movement across the network, such
>>>>>>>>>>>>>> optimizations would really have little impact on overall 
>>>>>>>>>>>>>> performance. Not
>>>>>>>>>>>>>> to mention that most query engines would already optimize the 
>>>>>>>>>>>>>> shuffle
>>>>>>>>>>>>>> already as much as it can be optimized.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is true that if actual duplicate keys would make the key
>>>>>>>>>>>>>> matching join (anti-join) somewhat more expensive, however it 
>>>>>>>>>>>>>> can be done
>>>>>>>>>>>>>> in such a way that if the keys are in practice unique the join 
>>>>>>>>>>>>>> is as
>>>>>>>>>>>>>> efficient as it can be.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let me try and clarify each point:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - lookup for query or update on a
>>>>>>>>>>>>>>> non-(partition/bucket/sort) key predicate implies scanning 
>>>>>>>>>>>>>>> large amounts of
>>>>>>>>>>>>>>> data - because these are the only data structures that can 
>>>>>>>>>>>>>>> narrow down the
>>>>>>>>>>>>>>> lookup, right ? One could argue that the min/max index (file 
>>>>>>>>>>>>>>> skipping) can
>>>>>>>>>>>>>>> be applied to any column, but in reality if that column is not 
>>>>>>>>>>>>>>> sorted the
>>>>>>>>>>>>>>> min/max intervals can have huge overlaps so it may be next to 
>>>>>>>>>>>>>>> useless.
>>>>>>>>>>>>>>> - remote storage - this is a critical architecture decision
>>>>>>>>>>>>>>> - implementations on local storage imply a vastly different 
>>>>>>>>>>>>>>> design for the
>>>>>>>>>>>>>>> entire system, storage and compute.
>>>>>>>>>>>>>>> - deleting single records per snapshot is unfeasible in
>>>>>>>>>>>>>>> eager but also particularly in the lazy design: each deletion 
>>>>>>>>>>>>>>> creates a
>>>>>>>>>>>>>>> very small snapshot. Deleting 1 million records one at a time 
>>>>>>>>>>>>>>> would create
>>>>>>>>>>>>>>> 1 million small files, and 1 million RPC calls.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Why is this unfeasible? If I have a dataset of 100mm files
>>>>>>>>>>>>>> including 1mm small files, is that a major problem? It seems 
>>>>>>>>>>>>>> like your
>>>>>>>>>>>>>> usecase isn't one where you want to support single record 
>>>>>>>>>>>>>> deletes but it is
>>>>>>>>>>>>>> definitely something important to many people.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 100 mm total files or 1 mm files per dataset is definitely a
>>>>>>>>>>>>>> problem on HDFS, and I believe on S3 too. Single key delete 
>>>>>>>>>>>>>> would work just
>>>>>>>>>>>>>> fine, but it's simply not optimal to do that on remote storage. 
>>>>>>>>>>>>>> This is a
>>>>>>>>>>>>>> very well known problem with HDFS, and one of the very reasons 
>>>>>>>>>>>>>> to have
>>>>>>>>>>>>>> something like Iceberg in the first place.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Basically the users would be able to do single key mutation,
>>>>>>>>>>>>>> but it's not the use case we should be optimizing for, but it's 
>>>>>>>>>>>>>> really not
>>>>>>>>>>>>>> advisable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Eager is conceptually just lazy + compaction done, well,
>>>>>>>>>>>>>>> eagerly. The logic for both is exactly the same, the trade-off 
>>>>>>>>>>>>>>> is just that
>>>>>>>>>>>>>>> with eager you implicitly compact every time so that you don't 
>>>>>>>>>>>>>>> do any work
>>>>>>>>>>>>>>> on read, while with lazy
>>>>>>>>>>>>>>> you want to amortize the cost of compaction over multiple
>>>>>>>>>>>>>>> snapshots.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Basically there should be no difference between the two
>>>>>>>>>>>>>>> conceptually, or with regard to keys, etc. The only difference 
>>>>>>>>>>>>>>> is some
>>>>>>>>>>>>>>> mechanics in implementation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think you have deconstruct the problem too much to say
>>>>>>>>>>>>>> these are the same (or at least that is what I'm starting to 
>>>>>>>>>>>>>> think given
>>>>>>>>>>>>>> this thread). It seems like real world implementation decisions 
>>>>>>>>>>>>>> (per our
>>>>>>>>>>>>>> discussion here) are in conflict. For example, you just argued 
>>>>>>>>>>>>>> against
>>>>>>>>>>>>>> having a 1mm arbitrary mutations but I think that is because you 
>>>>>>>>>>>>>> aren't
>>>>>>>>>>>>>> thinking about things over time with a delta implementation. 
>>>>>>>>>>>>>> Having 10,000
>>>>>>>>>>>>>> mutations a day where we do delta compaction once a week
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and local file mappings (key to offset sparse bitmaps) seems
>>>>>>>>>>>>>> like it could result in very good performance in a case where 
>>>>>>>>>>>>>> we're
>>>>>>>>>>>>>> mutating small amounts of data. In this scenario, you may not do 
>>>>>>>>>>>>>> major
>>>>>>>>>>>>>> compaction ever unless you get to a high enough percentage of 
>>>>>>>>>>>>>> records that
>>>>>>>>>>>>>> have been deleted in the original dataset. That drives a very 
>>>>>>>>>>>>>> different set
>>>>>>>>>>>>>> of implementation decisions from a situation where you're trying 
>>>>>>>>>>>>>> to restate
>>>>>>>>>>>>>> an entire partition at once.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We operate on 1 billion mutations per day at least. This is
>>>>>>>>>>>>>> the problem Iceberg wants to solve, I believe it's stated 
>>>>>>>>>>>>>> upfront.
>>>>>>>>>>>>>> 10000/day is not a big data problem. It can be done fairly 
>>>>>>>>>>>>>> trivially and it
>>>>>>>>>>>>>> would be supported, but there's not much point in extra 
>>>>>>>>>>>>>> optimizing for this
>>>>>>>>>>>>>> use case I believe.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>> Netflix
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to