Upserts in Iceberg

Ryan Blue Tue, 11 Jun 2019 09:38:01 -0700

Erik, thanks for working on this doc. It’s a good detailed write-up of the
approach using natural keys and I found the section about efficiently
merging changes together to be helpful.


For the specific updates to Iceberg metadata, I think we will need to
update it. Dan and I were thinking about the sequencing problem last week
and came up with a solution to augment what’s in the proposal.
Conceptually, what is in the doc would definitely work, but it hits a few
pitfalls that we’ve already had work around:

   - In early versions Iceberg kept the list of manifests in the table
   metadata (JSON) file that are equivalent to the manifest list for each
   version. That ended up taking too much space, causing high write latency.
   It also required opening too many files at planning time.
   - Maintaining separate manifest lists for each version requires keeping
   versions in separate metadata trees, so we can’t merge manifest files
   together across versions. We initially had this problem when we used only
   fast appends for our streaming writes. We ended up with way too many
   metadata files and job planning took a really long time.
   - Although snapshots are independent and can be aged off, this requires
   rewriting the base version to expire versions. That means we can’t delete
   old data files until versions are rewritten, so the problem affects
   versions instead of snapshots. We would still have the problem of not being
   able to delete old data until after a compaction.

The solution we came up with is to assign each data or delete file a
sequence number. This is basically the same thing as the version numbers in
your doc, so what you described for reading would work from the point where
you have files that are associated with a version number. That requires a
way to assign sequence numbers without needing to rewrite metadata when
commits conflict.

The proposal included sequence numbers in the table metadata file as
versions and avoided rewrites by adding more manifest lists. Building on
that idea, we suggest adding those sequence numbers to each manifest file
in a manifest list. That way, we only have one manifest list file per
snapshot just like the current format. This works because the manifest list
is always rewritten when there is a commit conflict. The proposal avoids
rewriting manifest lists by embedding this information in the table
metadata file, but then each conflict causes you to rewrite the version
list for all snapshots: instead of rewriting 2 files, it rewrites one much
larger file — as I mentioned above, that cost is part of why we introduced
manifest lists.

Because the manifest list is rewritten when commits conflict, the sequence
number of manifests added in that snapshot can be updated. That solves one
problem, but then we would still have the problem that the actual manifests
do not contain sequence numbers so we can’t merge manifests. To solve that,
we propose optionally writing a sequence number with each data file in a
manifest. If a data or delete file doesn’t have a sequence number, it
inherits the sequence number of the manifest on read. Then compacting
manifests just works because data files are guaranteed to have a correct
sequence number when they are read. Also, manifests don’t need to be
rewritten during a conflict because new manifests are written without a
sequence number for each data file: the data files inherit the sequence
number that is set for the manifest when it is committed.

Using this strategy, we can store all of the version information from the
proposal as sequence numbers in the existing metadata structures, with
minimal changes to Iceberg’s existing metadata structure. Here are some
benefits to this approach:

   - There is no explicit version that must be expired so compaction is
   only required to optimize reads — partition or file-level delete can
   coexist with deltas
   - When a delete’s sequence number is lower than all data files, it can
   be pruned
   - Compaction can translate a natural key delete to a synthetic key
   delete by adding a synthetic key delete file and updating the data file’s
   sequence number
   - Synthetic and natural key approaches can coexist
   - More of the existing Iceberg operations and structure are reused —
   less code is required to get this working

I think by combining your ideas with the sequence number approach, we’re
fairly close to something we can start building and I’m confident that it
would perform well.

I’m happy to reply with updates if what I’m proposing isn’t clear enough to
understand. We can also set up another sync-up to talk this through if
everyone is interested.

rb

On Fri, Jun 7, 2019 at 11:47 AM Ryan Blue <[email protected]> wrote:

> Thanks, Erik! Great to see progress here. I'll set aside some time to look
> this over in detail.
>
> On Fri, Jun 7, 2019 at 11:46 AM Erik Wright <[email protected]>
> wrote:
>
>> I apologize for the delay, but I have finally put together a document
>> describing an alternative approach to supporting updates in Iceberg while
>> minimizing write amplification.
>>
>> Proposal: Iceberg Merge-on-Read
>> <https://docs.google.com/document/d/1KuOMeS8Hw_yuE5IXtII8EClJlEMtqFECfWbg-gN5lsQ/edit?usp=sharing>
>>
>> Thank you, Anton and Miguel, for starting this conversation, and everyone
>> else as well for the ongoing dialogue. I'm looking forward to continuing to
>> discuss this and hopefully finding an approach that can meet our different
>> needs and that we can work together on.
>>
>> Cheers,
>>
>> Erik
>>
>> On Wed, May 29, 2019 at 7:26 PM Jacques Nadeau <[email protected]>
>> wrote:
>>
>>> Yeah, I totally forgot to record our discussion. Will do so next time,
>>> sorry.
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Wed, May 29, 2019 at 4:24 PM Ryan Blue <[email protected]> wrote:
>>>
>>>> It wasn't recorded, but I can summarize what we talked about. Sorry I
>>>> haven't sent this out earlier.
>>>>
>>>> We talked about the options and some of the background in Iceberg --
>>>> basically that it isn't possible to determine the order of commits before
>>>> you commit so you can't rely on some monotonically increasing value from a
>>>> snapshot to know which deltas to apply to a file. The result is that we
>>>> can't apply diffs to data files using a rule like "files older than X"
>>>> because we can't identify those files without the snapshot history.
>>>>
>>>> That gives us basically 2 options for scoping delete diffs: either
>>>> identify the files to apply a diff to when writing the diff, or log changes
>>>> applied to a snapshot and keep the snapshot history around (which is how we
>>>> know the order of snapshots). The first option is not good if you want to
>>>> write without reading data to determine where the deleted records are. The
>>>> second prevents cleaning up snapshot history.
>>>>
>>>> We also talked about whether we should encode IDs in data files.
>>>> Jacques pointed out that retrying a commit is easier if you don't need to
>>>> re-read the original data to reconcile changes. For example, if a data file
>>>> was compacted in a concurrent write, how do we reconcile a delete for it?
>>>> We discussed other options, like rolling back the compaction for delete
>>>> events. I think that's a promising option.
>>>>
>>>> For action items, Jacques was going to think about whether we need to
>>>> encode IDs in data files or if we could use positions to identify rows and
>>>> write up a summary/proposal. Erik was going to take on planning how
>>>> identifying rows without reading data would work and similarly write up a
>>>> summary/proposal.
>>>>
>>>> That's from memory, so if I've missed anything, I hope that other
>>>> attendees will fill in the details!
>>>>
>>>> rb
>>>>
>>>> On Wed, May 29, 2019 at 3:34 PM Venkatakrishnan Sowrirajan <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> I couldn't attend the meeting. Just curious, if this is recorded by
>>>>> any chance.
>>>>>
>>>>> Regards
>>>>> Venkata krishnan
>>>>>
>>>>>
>>>>> On Fri, May 24, 2019 at 8:49 AM Ryan Blue <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Yes, I agree. I'll talk a little about a couple of the constraints of
>>>>>> this as well.
>>>>>>
>>>>>> On Fri, May 24, 2019 at 5:52 AM Anton Okolnychyi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> The agenda looks good to me. I think it would also make sense to
>>>>>>> clarify the responsibilities of query engines and Iceberg. Not only in
>>>>>>> terms of uniqueness, but also in terms of applying diffs on read, for
>>>>>>> example.
>>>>>>>
>>>>>>> On 23 May 2019, at 01:59, Ryan Blue <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Here’s a rough agenda:
>>>>>>>
>>>>>>>    - Use cases: everyone come with a use case that you’d like to
>>>>>>>    have supported. We’ll go around and introduce ourselves and our use 
>>>>>>> cases.
>>>>>>>    - Main topic: How should Iceberg identify rows that are deleted?
>>>>>>>    - Side topics from my initial email, if we have time: should we
>>>>>>>    use insert diffs, should we support dense and sparse formats, etc.
>>>>>>>
>>>>>>> The main topic I think we should discuss is: *How should Iceberg
>>>>>>> identify rows that are deleted?*
>>>>>>>
>>>>>>> I’m phrasing it this way to avoid where I think we’re talking past
>>>>>>> one another because we are making assumptions. The important thing is 
>>>>>>> that
>>>>>>> there are two main options:
>>>>>>>
>>>>>>>    - Filename and position, vs
>>>>>>>    - Specific values of (few) columns in the data
>>>>>>>
>>>>>>> This phrasing also avoids discussing uniqueness constraints. Once we
>>>>>>> get down to behavior, I think we agree. For example, I think we all 
>>>>>>> agree
>>>>>>> that uniqueness cannot be enforced in Iceberg.
>>>>>>>
>>>>>>> If uniqueness can’t be enforced in Iceberg, the main choice comes
>>>>>>> down to how we identify rows that are deleted. If we use (filename,
>>>>>>> position) then we know that there is only one row. On the other hand, 
>>>>>>> if we
>>>>>>> use data values to identify rows then a delete may identify more than 
>>>>>>> one
>>>>>>> row because there are no uniqueness guarantees. I think we also agree 
>>>>>>> that
>>>>>>> if there is more than one row identified, all of them should be deleted.
>>>>>>>
>>>>>>> At that point, there are trade-offs between the approaches:
>>>>>>>
>>>>>>>    - When identifying deleted rows by data values, situations like
>>>>>>>    the one that Anton pointed out are possible.
>>>>>>>    - Jacques also had a good point about concurrency. If at all
>>>>>>>    possible, we want to be able to reconcile changes between concurrent
>>>>>>>    commits without re-running an operation.
>>>>>>>
>>>>>>> Sound like a reasonable amount to talk through?
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Wed, May 22, 2019 at 1:17 PM Erik Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 22, 2019 at 4:04 PM Cristian Opris <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Agreed with Erik here, we're certainly not looking to build the
>>>>>>>>> equivalent of a relational database, and for that matter not even 
>>>>>>>>> that of a
>>>>>>>>> local disk storage analytics database (like Vertica). Those are very
>>>>>>>>> different designs with very different trade-offs and optimizations.
>>>>>>>>>
>>>>>>>>> We're looking to automate and optimize specific types of file
>>>>>>>>> manipulation for large files on remote storage, while presenting that 
>>>>>>>>> to
>>>>>>>>> the user under the common SQL API for *bulk* data manipulation
>>>>>>>>> (MERGE INTO)
>>>>>>>>>
>>>>>>>>
>>>>>>>> What I would encourage is to decouple the storage model from the
>>>>>>>> implementation of that API. If Iceberg has support for merge-on-read of
>>>>>>>> upserts and deletes, in addition to its powerful support for 
>>>>>>>> partitioning,
>>>>>>>> it will be easy for a higher-level application to implement those APIs
>>>>>>>> given certain other constraints (that might not be appropriate to all
>>>>>>>> applications).
>>>>>>>>
>>>>>>>> Myself and Miguel are out on Friday, but Anton should be able to
>>>>>>>>> handle the discussion on our side.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Cristian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 22 May 2019, at 17:51, Erik Wright <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>> We have two rows with the same natural key and we use that natural
>>>>>>>>>> key in diff files:
>>>>>>>>>> nk | col1 | col2
>>>>>>>>>> 1 | 1 | 1
>>>>>>>>>> 1 | 2 | 2
>>>>>>>>>> Then we have a delete statement:
>>>>>>>>>> DELETE FROM t WHERE col1 = 1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think this example cuts to the point of the differences of
>>>>>>>>> understanding. Does Iceberg want to be approaching the utility of a
>>>>>>>>> relational database, against which I can execute complex update 
>>>>>>>>> queries?
>>>>>>>>> This is not what I would have imagined.
>>>>>>>>>
>>>>>>>>> I would have, instead, imagined that it was up to the client to
>>>>>>>>> identify, through whatever means, that they want to update or delete 
>>>>>>>>> a row
>>>>>>>>> with a given ID. If there are multiple (distinct) rows with the same 
>>>>>>>>> ID,
>>>>>>>>> _too bad_. Any user should _expect_ that they could potentially see 
>>>>>>>>> any one
>>>>>>>>> or more of those rows at read time. And that an upsert/delete would 
>>>>>>>>> affect
>>>>>>>>> any/all of them (I would argue for all).
>>>>>>>>>
>>>>>>>>> *In summary:* Instead of trying to come up with a consistent,
>>>>>>>>> logical handling for complex queries that are best suited for a 
>>>>>>>>> relational
>>>>>>>>> database, leave such handling up to the client and concentrate on 
>>>>>>>>> problems
>>>>>>>>> that can be solved simply and more generally.
>>>>>>>>>
>>>>>>>>> On Wed, May 22, 2019 at 12:11 PM Ryan Blue <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Yes, I think we should. I was going to propose one after catching
>>>>>>>>>> up on the rest of this thread today.
>>>>>>>>>>
>>>>>>>>>> On Wed, May 22, 2019 at 9:08 AM Anton Okolnychyi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks! Would it make sense to discuss the agenda in advance?
>>>>>>>>>>>
>>>>>>>>>>> On 22 May 2019, at 17:04, Ryan Blue <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I sent out an invite and included everyone on this thread. If
>>>>>>>>>>> anyone else would like to join, please join the Zoom meeting. If 
>>>>>>>>>>> you'd like
>>>>>>>>>>> to be added to the calendar invite, just let me know and I'll add 
>>>>>>>>>>> you.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 22, 2019 at 8:57 AM Jacques Nadeau <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> works for me.
>>>>>>>>>>>>
>>>>>>>>>>>> To make things easier, we can use my zoom meeting if people
>>>>>>>>>>>> like:
>>>>>>>>>>>>
>>>>>>>>>>>> Join Zoom Meeting
>>>>>>>>>>>> https://zoom.us/j/4157302092
>>>>>>>>>>>>
>>>>>>>>>>>> One tap mobile
>>>>>>>>>>>> +16465588656,,4157302092# US (New York)
>>>>>>>>>>>> +16699006833,,4157302092# US (San Jose)
>>>>>>>>>>>>
>>>>>>>>>>>> Dial by your location
>>>>>>>>>>>>         +1 646 558 8656 US (New York)
>>>>>>>>>>>>         +1 669 900 6833 US (San Jose)
>>>>>>>>>>>>         877 853 5257 US Toll-free
>>>>>>>>>>>>         888 475 4499 US Toll-free
>>>>>>>>>>>> Meeting ID: 415 730 2092
>>>>>>>>>>>> Find your local number: https://zoom.us/u/aH9XYBfm
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, May 22, 2019 at 8:54 AM Ryan Blue <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> 9AM on Friday works best for me. How about then?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, May 22, 2019 at 5:05 AM Anton Okolnychyi <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> What about this Friday? One hour slot from 9:00 to 10:00 am
>>>>>>>>>>>>>> or 10:00 to 11:00 am PST? Some folks are based in London, so 
>>>>>>>>>>>>>> meeting later
>>>>>>>>>>>>>> than this is hard. If Friday doesn’t work, we can consider 
>>>>>>>>>>>>>> Tuesday or
>>>>>>>>>>>>>> Wednesday next week.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 22 May 2019, at 00:54, Jacques Nadeau <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree with Anton that we should probably spend some time on
>>>>>>>>>>>>>> hangouts further discussing things. Definitely differing 
>>>>>>>>>>>>>> expectations here
>>>>>>>>>>>>>> and we seem to be talking a bit past each other.
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, May 21, 2019 at 3:44 PM Cristian Opris <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I love a good flame war :P
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 21 May 2019, at 22:57, Jacques Nadeau <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That's my point, truly independent writers (two Spark jobs,
>>>>>>>>>>>>>>>> or a Spark job and Dremio job) means a distributed 
>>>>>>>>>>>>>>>> transaction. It would
>>>>>>>>>>>>>>>> need yet another external transaction coordinator on top of 
>>>>>>>>>>>>>>>> both Spark and
>>>>>>>>>>>>>>>> Dremio, Iceberg by itself
>>>>>>>>>>>>>>>> cannot solve this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not ready to accept this. Iceberg already supports a set
>>>>>>>>>>>>>>> of semantics around multiple writers committing simultaneously 
>>>>>>>>>>>>>>> and how
>>>>>>>>>>>>>>> conflict resolution is done. The same can be done here.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> MVCC (which is what Iceberg tries to implement) requires a
>>>>>>>>>>>>>>> total ordering of snapshots. Also the snapshots need to be 
>>>>>>>>>>>>>>> non-conflicting.
>>>>>>>>>>>>>>> I really don't see how any metadata data structures can solve 
>>>>>>>>>>>>>>> this without
>>>>>>>>>>>>>>> an outside coordinator.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Consider this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Snapshot 0: (K,A) = 1
>>>>>>>>>>>>>>> Job X: UPDATE K SET A=A+1
>>>>>>>>>>>>>>> Job Y: UPDATE K SET A=10
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What should the final value of A be and who decides ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> By single writer, I don't mean single process, I mean
>>>>>>>>>>>>>>>> multiple coordinated processes like Spark executors 
>>>>>>>>>>>>>>>> coordinated by Spark
>>>>>>>>>>>>>>>> driver. The coordinator ensures that the data is 
>>>>>>>>>>>>>>>> pre-partitioned on
>>>>>>>>>>>>>>>> each executor, and the coordinator commits the snapshot.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Note however that single writer job/multiple concurrent
>>>>>>>>>>>>>>>> reader jobs is perfectly feasible, i.e. it shouldn't be a 
>>>>>>>>>>>>>>>> problem to write
>>>>>>>>>>>>>>>> from a Spark job and read from multiple Dremio queries 
>>>>>>>>>>>>>>>> concurrently (for
>>>>>>>>>>>>>>>> example)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> :D This is still "single process" from my perspective. That
>>>>>>>>>>>>>>> process may be coordinating other processes to do distributed 
>>>>>>>>>>>>>>> work but
>>>>>>>>>>>>>>> ultimately it is a single process.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Fair enough
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure what you mean exactly. If we can't enforce
>>>>>>>>>>>>>>>> uniqueness we shouldn't assume it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I disagree. We can specify that as a requirement and state
>>>>>>>>>>>>>>> that you'll get unintended consequences if you provide your own 
>>>>>>>>>>>>>>> keys and
>>>>>>>>>>>>>>> don't maintain this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There's no need for unintended consequences, we can specify
>>>>>>>>>>>>>>> consistent behaviour (and I believe the document says what that 
>>>>>>>>>>>>>>> is)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We do expect that most of the time the natural key is
>>>>>>>>>>>>>>>> unique, but the eager and lazy with natural key designs can 
>>>>>>>>>>>>>>>> handle
>>>>>>>>>>>>>>>> duplicates
>>>>>>>>>>>>>>>> consistently. Basically it's not a problem to have
>>>>>>>>>>>>>>>> duplicate natural keys, everything works fine.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That heavily depends on how things are implemented. For
>>>>>>>>>>>>>>> example, we may write a bunch of code that generates internal 
>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>> structures based on this expectation. If we have to support 
>>>>>>>>>>>>>>> duplicate
>>>>>>>>>>>>>>> matches, all of sudden we can no longer size various data 
>>>>>>>>>>>>>>> structures to
>>>>>>>>>>>>>>> improve performance and may be unable to preallocate memory 
>>>>>>>>>>>>>>> associated with
>>>>>>>>>>>>>>> a guaranteed completion.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Again we need to operate on the assumption that this is a
>>>>>>>>>>>>>>> large scale distributed compute/remote storage scenario. Key 
>>>>>>>>>>>>>>> matching is
>>>>>>>>>>>>>>> done with shuffles with data movement across the network, such
>>>>>>>>>>>>>>> optimizations would really have little impact on overall 
>>>>>>>>>>>>>>> performance. Not
>>>>>>>>>>>>>>> to mention that most query engines would already optimize the 
>>>>>>>>>>>>>>> shuffle
>>>>>>>>>>>>>>> already as much as it can be optimized.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It is true that if actual duplicate keys would make the key
>>>>>>>>>>>>>>> matching join (anti-join) somewhat more expensive, however it 
>>>>>>>>>>>>>>> can be done
>>>>>>>>>>>>>>> in such a way that if the keys are in practice unique the join 
>>>>>>>>>>>>>>> is as
>>>>>>>>>>>>>>> efficient as it can be.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let me try and clarify each point:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - lookup for query or update on a
>>>>>>>>>>>>>>>> non-(partition/bucket/sort) key predicate implies scanning 
>>>>>>>>>>>>>>>> large amounts of
>>>>>>>>>>>>>>>> data - because these are the only data structures that can 
>>>>>>>>>>>>>>>> narrow down the
>>>>>>>>>>>>>>>> lookup, right ? One could argue that the min/max index (file 
>>>>>>>>>>>>>>>> skipping) can
>>>>>>>>>>>>>>>> be applied to any column, but in reality if that column is not 
>>>>>>>>>>>>>>>> sorted the
>>>>>>>>>>>>>>>> min/max intervals can have huge overlaps so it may be next to 
>>>>>>>>>>>>>>>> useless.
>>>>>>>>>>>>>>>> - remote storage - this is a critical architecture decision
>>>>>>>>>>>>>>>> - implementations on local storage imply a vastly different 
>>>>>>>>>>>>>>>> design for the
>>>>>>>>>>>>>>>> entire system, storage and compute.
>>>>>>>>>>>>>>>> - deleting single records per snapshot is unfeasible in
>>>>>>>>>>>>>>>> eager but also particularly in the lazy design: each deletion 
>>>>>>>>>>>>>>>> creates a
>>>>>>>>>>>>>>>> very small snapshot. Deleting 1 million records one at a time 
>>>>>>>>>>>>>>>> would create
>>>>>>>>>>>>>>>> 1 million small files, and 1 million RPC calls.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Why is this unfeasible? If I have a dataset of 100mm files
>>>>>>>>>>>>>>> including 1mm small files, is that a major problem? It seems 
>>>>>>>>>>>>>>> like your
>>>>>>>>>>>>>>> usecase isn't one where you want to support single record 
>>>>>>>>>>>>>>> deletes but it is
>>>>>>>>>>>>>>> definitely something important to many people.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 100 mm total files or 1 mm files per dataset is definitely a
>>>>>>>>>>>>>>> problem on HDFS, and I believe on S3 too. Single key delete 
>>>>>>>>>>>>>>> would work just
>>>>>>>>>>>>>>> fine, but it's simply not optimal to do that on remote storage. 
>>>>>>>>>>>>>>> This is a
>>>>>>>>>>>>>>> very well known problem with HDFS, and one of the very reasons 
>>>>>>>>>>>>>>> to have
>>>>>>>>>>>>>>> something like Iceberg in the first place.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Basically the users would be able to do single key mutation,
>>>>>>>>>>>>>>> but it's not the use case we should be optimizing for, but it's 
>>>>>>>>>>>>>>> really not
>>>>>>>>>>>>>>> advisable.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Eager is conceptually just lazy + compaction done, well,
>>>>>>>>>>>>>>>> eagerly. The logic for both is exactly the same, the trade-off 
>>>>>>>>>>>>>>>> is just that
>>>>>>>>>>>>>>>> with eager you implicitly compact every time so that you don't 
>>>>>>>>>>>>>>>> do any work
>>>>>>>>>>>>>>>> on read, while with lazy
>>>>>>>>>>>>>>>> you want to amortize the cost of compaction over multiple
>>>>>>>>>>>>>>>> snapshots.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Basically there should be no difference between the two
>>>>>>>>>>>>>>>> conceptually, or with regard to keys, etc. The only difference 
>>>>>>>>>>>>>>>> is some
>>>>>>>>>>>>>>>> mechanics in implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think you have deconstruct the problem too much to say
>>>>>>>>>>>>>>> these are the same (or at least that is what I'm starting to 
>>>>>>>>>>>>>>> think given
>>>>>>>>>>>>>>> this thread). It seems like real world implementation decisions 
>>>>>>>>>>>>>>> (per our
>>>>>>>>>>>>>>> discussion here) are in conflict. For example, you just argued 
>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>> having a 1mm arbitrary mutations but I think that is because 
>>>>>>>>>>>>>>> you aren't
>>>>>>>>>>>>>>> thinking about things over time with a delta implementation. 
>>>>>>>>>>>>>>> Having 10,000
>>>>>>>>>>>>>>> mutations a day where we do delta compaction once a week
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and local file mappings (key to offset sparse bitmaps) seems
>>>>>>>>>>>>>>> like it could result in very good performance in a case where 
>>>>>>>>>>>>>>> we're
>>>>>>>>>>>>>>> mutating small amounts of data. In this scenario, you may not 
>>>>>>>>>>>>>>> do major
>>>>>>>>>>>>>>> compaction ever unless you get to a high enough percentage of 
>>>>>>>>>>>>>>> records that
>>>>>>>>>>>>>>> have been deleted in the original dataset. That drives a very 
>>>>>>>>>>>>>>> different set
>>>>>>>>>>>>>>> of implementation decisions from a situation where you're 
>>>>>>>>>>>>>>> trying to restate
>>>>>>>>>>>>>>> an entire partition at once.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We operate on 1 billion mutations per day at least. This is
>>>>>>>>>>>>>>> the problem Iceberg wants to solve, I believe it's stated 
>>>>>>>>>>>>>>> upfront.
>>>>>>>>>>>>>>> 10000/day is not a big data problem. It can be done fairly 
>>>>>>>>>>>>>>> trivially and it
>>>>>>>>>>>>>>> would be supported, but there's not much point in extra 
>>>>>>>>>>>>>>> optimizing for this
>>>>>>>>>>>>>>> use case I believe.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to