Thanks for putting this forward.

Another term for the "lazy" approach would be "merge on read".

My team has built something internally that uses merge-on-read internally
but uses an "Eager" materialization for publication to Presto. Roughly, we
maintain a table metadata file that looks a bit like Iceberg's and tracks
the "live" version of each partition as it is updated over time. We are
looking into a solution that will allow us to push the merge-on-read all
the way to Presto (and other consumers), and adding Merge-On-Read to
Iceberg is one of the approaches we are considering.

It's worth noting that Hudi does have support for upserts/deletes as well,
so that's another model to consider.

On Fri, May 10, 2019 at 8:30 AM Miguel Miranda
<miguelnmira...@apple.com.invalid> wrote:

> Hi,
>
> As Anton said, we purposely avoided making a "decision" on which approach
> should be implemented in order to allow for a meaningful discussion with
> the community.
>
> The document starts with an eager approach as it is straightforward and
> easy to understand: steps resemble the usual file level
> operations/manipulations frequently used by engineers when implementing
> Update/Delete/Upsert behaviour themselves, hopefully creating a conceptual
> bridge to the more involved designs. Right now, Iceberg has almost
> everything to implement the "eager" approach as we simply need to adjust
> the retry mechanism. For example, I have implemented a prototype of the
> eager solution with Spark and Iceberg.
>
> We looked into many existing solutions for inspiration, but when there
> isn't a paper or code in the public domain it becomes hard to assess the
> underlying design, although some of it can be inferred from the API or
> documentation.
>
> Best,
> Miguel
>
> On 10 May 2019, at 11:57, Anton Okolnychyi <aokolnyc...@apple.com> wrote:
>
> Thanks for the feedback, Jacques!
>
> You are correct, we kept the question of the best approach as open :) The
> idea was to have a discussion in the community. Hopefully, we can reach a
> consensus.
>
> While the proposed “lazy” approaches certainly offer significant benefits,
> they require more changes in Iceberg as well as in readers/query engines
> (depending on how we want to merge base and diff files). For us, it is
> important to understand whether the Iceberg community would even consider
> such changes.
>
> Hive ACID 3 is one the projects we looked at. In fact, we spoke to Owen,
> the original creator of updates/deletes/upserts in Hive. I believe the
> “lazy” approaches are close to what Hive 3 does but with their own
> distinctions that Iceberg allows us to have. It would be great to have
> Owen’s feedback.
>
> We don’t know the internals of Delta as updates/deletes/upserts are not
> open source. My personal guess, yes, it might be similar to the “eager”
> approach in our doc.
>
> Jacques, could you share some insights how you implement the merge of
> diffs? Is it done by readers?
>
> Thanks,
> Anton
>
> On 10 May 2019, at 06:24, Jacques Nadeau <jacq...@dremio.com> wrote:
>
> This is a nice doc and it covers many different options. Upon first skim,
> I don't see a strong argument for particular approach. D
>
> In our own development, we've been leaning heavily towards what you
> describe in the document as "lazy with SRI". I believe this is consistent
> with what the Hive community did on top of Orc. It's interesting because my
> (maybe incorrect) understanding of the Databricks Delta approach is they
> chose what you title "eager" in their approach to upserts. They may also
> have a lazy approach for other types of mutations but I don't think they do.
>
> Thanks again for putting this together!
> Jacques
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Wed, May 8, 2019 at 3:42 AM Anton Okolnychyi <
> aokolnyc...@apple.com.invalid> wrote:
>
>> Hi folks,
>>
>> Miguel (cc) and I have spent some time thinking about how to perform
>> updates/deletes/upserts on top of Iceberg tables. This functionality is
>> essential for many modern use cases. We've summarized our ideas in a doc
>> [1], which, hopefully, will trigger a discussion in the community. The
>> document presents different conceptual approaches alongside their
>> trade-offs. We will be glad to consider any other ideas as well.
>>
>> Thanks,
>> Anton
>>
>> [1] -
>> https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/
>>
>>
>>
>
>

Reply via email to