Upserts in Iceberg

Erik Wright Thu, 16 May 2019 10:26:49 -0700

I would be happy to participate. Iceberg with merge-on-read capabilities is
a technology choice that my team is actively considering. It appears that
our scenario differs meaningfully from the one that Anton and Miguel are
considering. It would be great to take the time to compare the two and see
if there is a single implementation that can meet the needs of each
scenario.


On Wed, May 15, 2019 at 3:55 PM Ryan Blue <[email protected]> wrote:

> Thanks for working on this, Anton and Miguel!
>
> Would anyone be interested in scheduling a hangout to talk about next
> steps and tentative design choices?
>
> The doc is a great start and does a good job laying out the trade-offs
> between different approaches. I appreciate the idea to get a discussion
> started and not to pick one particular approach, but I think that it does
> make a few choices clear:
>
> *1. Iceberg should support lazy (read-side) merging using diff files*
>
> The eager approach doesn’t require much beyond Iceberg’s existing support.
> Adding diff files is the next step for engines that need to implement lazy
> merging for merge/upsert/delete. I support adding these structures to the
> spec (as a new format version).
>
> *2. Iceberg diff files should use synthetic keys*
>
> A lot of the discussion on the doc is about whether natural keys are
> practical or what assumptions we can make or trade about them. In my
> opinion, Iceberg tables will absolutely need natural keys for reasonable
> use cases. And those natural keys will need to be unique. And Iceberg will
> need to rely on engines to enforce that uniqueness.
>
> But, there is a difference between table behavior and implementation. We
> can use synthetic keys to implement the requirements of natural keys. Each
> row should be identified by its file and position in a file. When deleting
> by a natural key, we just need to find out what the synthetic key is and
> encode that in the delete diff.
>
This comment has important implications for the effort required to generate
delete diff files. I've tried to cover why in comments I added today to the
doc, but it could also be a topic of the hangout.


> With the physical representation using synthetic keys, we should also
> define how to communicate a natural key constraint for a table. That way,
> writers can fail if a write may violate the key constraints of a table.
>
> *3. Synthetic keys should be based on filename and position*
>
> I think identifying the file in a synthetic key makes a lot of sense. This
> would allow for delta file reuse as individual files are rewritten by a
> “major” compaction and provides nice flexibility that fits with the format.
> We will need to think through all the impacts, like how file relocation
> works (e.g., move between regions) and the requirements for rewrites (must
> apply the delta when rewriting).
>
I'm confused. I feel like specifying the filename has the opposite effect.
One of the biggest advantages of Iceberg is the decoupling of a dataset
from physical location of the constituent files. If a delta file encodes
the filename of the row that it updates/deletes you are putting a
significant constraint on the way that an implementation can manipulate
those files later.


> *Open questions*
>
> There are also quite a few remaining questions for a design:
>
>    - Should Iceberg use insert diff files? (My initial answer is no)
>    - Should Iceberg require diff compaction? Iceberg could require one
>    delete diff per partition, for example. (My answer: no)
>    - Should data files store synthetic key position? If so, why?
>    - Should there be a dense format for deletes, or just a sparse format?
>    - What is the scope of a delete diff? At a minimum, partition. But
>    does it make sense to build ways to restrict scope further?
>
>
> On Fri, May 10, 2019 at 11:27 AM Anton Okolnychyi
> <[email protected]> wrote:
>
>> We did take a look at Hudi. The overall design seems to be pretty
>> complicated and, unfortunately, I didn’t have time to explore every detail.
>>
>> Here is my understanding (correct me if I am wrong):
>>
>> - Hudi has RECORD_KEY, which is expected to be unique.
>> - Hudi has PRECOMBINED_KEY, which is used to pick only one row in the
>> incoming batch if there are multiple rows with the same key. As I
>> understand, this isn't used on reads. It is used on writes to deduplicate
>> rows with identical keys within one incoming batch. For example, if we are
>> inserting 10 records and two rows have the same key, PRECOMBINED_KEY will
>> be used to pick up only one row.
>> - Once Hudi ensures the uniqueness of RECORD_KEY within the incoming
>> batch, it loads the Bloom filter index from all existing Parquet files in
>> the involved partitions (meaning, partitions spread from the input batch)
>> and tags each record as either an update or insert by mapping the incoming
>> keys to existing files for updates. At this point, it seems to rely on join.
>>
>> Is my understanding correct? If so, do we want to consider joins on
>> write? We mentioned this technique as one way to ensure the uniqueness of
>> natural keys but we were concerned about the performance. Also, does Hudi
>> support record-level updates?
>>
>> Thanks,
>> Anton
>>
>> On 10 May 2019, at 18:22, Erik Wright <[email protected]>
>> wrote:
>>
>> Thanks for putting this forward.
>>
>> Another term for the "lazy" approach would be "merge on read".
>>
>> My team has built something internally that uses merge-on-read internally
>> but uses an "Eager" materialization for publication to Presto. Roughly, we
>> maintain a table metadata file that looks a bit like Iceberg's and tracks
>> the "live" version of each partition as it is updated over time. We are
>> looking into a solution that will allow us to push the merge-on-read all
>> the way to Presto (and other consumers), and adding Merge-On-Read to
>> Iceberg is one of the approaches we are considering.
>>
>> It's worth noting that Hudi does have support for upserts/deletes as
>> well, so that's another model to consider.
>>
>> On Fri, May 10, 2019 at 8:30 AM Miguel Miranda <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> As Anton said, we purposely avoided making a "decision" on which
>>> approach should be implemented in order to allow for a meaningful
>>> discussion with the community.
>>>
>>> The document starts with an eager approach as it is straightforward and
>>> easy to understand: steps resemble the usual file level
>>> operations/manipulations frequently used by engineers when implementing
>>> Update/Delete/Upsert behaviour themselves, hopefully creating a conceptual
>>> bridge to the more involved designs. Right now, Iceberg has almost
>>> everything to implement the "eager" approach as we simply need to adjust
>>> the retry mechanism. For example, I have implemented a prototype of the
>>> eager solution with Spark and Iceberg.
>>>
>>> We looked into many existing solutions for inspiration, but when there
>>> isn't a paper or code in the public domain it becomes hard to assess the
>>> underlying design, although some of it can be inferred from the API or
>>> documentation.
>>>
>>> Best,
>>> Miguel
>>>
>>> On 10 May 2019, at 11:57, Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>> Thanks for the feedback, Jacques!
>>>
>>> You are correct, we kept the question of the best approach as open :)
>>> The idea was to have a discussion in the community. Hopefully, we can reach
>>> a consensus.
>>>
>>> While the proposed “lazy” approaches certainly offer significant
>>> benefits, they require more changes in Iceberg as well as in readers/query
>>> engines (depending on how we want to merge base and diff files). For us, it
>>> is important to understand whether the Iceberg community would even
>>> consider such changes.
>>>
>>> Hive ACID 3 is one the projects we looked at. In fact, we spoke to Owen,
>>> the original creator of updates/deletes/upserts in Hive. I believe the
>>> “lazy” approaches are close to what Hive 3 does but with their own
>>> distinctions that Iceberg allows us to have. It would be great to have
>>> Owen’s feedback.
>>>
>>> We don’t know the internals of Delta as updates/deletes/upserts are not
>>> open source. My personal guess, yes, it might be similar to the “eager”
>>> approach in our doc.
>>>
>>> Jacques, could you share some insights how you implement the merge of
>>> diffs? Is it done by readers?
>>>
>>> Thanks,
>>> Anton
>>>
>>> On 10 May 2019, at 06:24, Jacques Nadeau <[email protected]> wrote:
>>>
>>> This is a nice doc and it covers many different options. Upon first
>>> skim, I don't see a strong argument for particular approach. D
>>>
>>> In our own development, we've been leaning heavily towards what you
>>> describe in the document as "lazy with SRI". I believe this is consistent
>>> with what the Hive community did on top of Orc. It's interesting because my
>>> (maybe incorrect) understanding of the Databricks Delta approach is they
>>> chose what you title "eager" in their approach to upserts. They may also
>>> have a lazy approach for other types of mutations but I don't think they do.
>>>
>>> Thanks again for putting this together!
>>> Jacques
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Wed, May 8, 2019 at 3:42 AM Anton Okolnychyi <
>>> [email protected]> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> Miguel (cc) and I have spent some time thinking about how to perform
>>>> updates/deletes/upserts on top of Iceberg tables. This functionality is
>>>> essential for many modern use cases. We've summarized our ideas in a doc
>>>> [1], which, hopefully, will trigger a discussion in the community. The
>>>> document presents different conceptual approaches alongside their
>>>> trade-offs. We will be glad to consider any other ideas as well.
>>>>
>>>> Thanks,
>>>> Anton
>>>>
>>>> [1] -
>>>> https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/
>>>>
>>>>
>>>>
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to