Thanks for putting this forward. Another term for the "lazy" approach would be "merge on read".
My team has built something internally that uses merge-on-read internally but uses an "Eager" materialization for publication to Presto. Roughly, we maintain a table metadata file that looks a bit like Iceberg's and tracks the "live" version of each partition as it is updated over time. We are looking into a solution that will allow us to push the merge-on-read all the way to Presto (and other consumers), and adding Merge-On-Read to Iceberg is one of the approaches we are considering. It's worth noting that Hudi does have support for upserts/deletes as well, so that's another model to consider. On Fri, May 10, 2019 at 8:30 AM Miguel Miranda <miguelnmira...@apple.com.invalid> wrote: > Hi, > > As Anton said, we purposely avoided making a "decision" on which approach > should be implemented in order to allow for a meaningful discussion with > the community. > > The document starts with an eager approach as it is straightforward and > easy to understand: steps resemble the usual file level > operations/manipulations frequently used by engineers when implementing > Update/Delete/Upsert behaviour themselves, hopefully creating a conceptual > bridge to the more involved designs. Right now, Iceberg has almost > everything to implement the "eager" approach as we simply need to adjust > the retry mechanism. For example, I have implemented a prototype of the > eager solution with Spark and Iceberg. > > We looked into many existing solutions for inspiration, but when there > isn't a paper or code in the public domain it becomes hard to assess the > underlying design, although some of it can be inferred from the API or > documentation. > > Best, > Miguel > > On 10 May 2019, at 11:57, Anton Okolnychyi <aokolnyc...@apple.com> wrote: > > Thanks for the feedback, Jacques! > > You are correct, we kept the question of the best approach as open :) The > idea was to have a discussion in the community. Hopefully, we can reach a > consensus. > > While the proposed “lazy” approaches certainly offer significant benefits, > they require more changes in Iceberg as well as in readers/query engines > (depending on how we want to merge base and diff files). For us, it is > important to understand whether the Iceberg community would even consider > such changes. > > Hive ACID 3 is one the projects we looked at. In fact, we spoke to Owen, > the original creator of updates/deletes/upserts in Hive. I believe the > “lazy” approaches are close to what Hive 3 does but with their own > distinctions that Iceberg allows us to have. It would be great to have > Owen’s feedback. > > We don’t know the internals of Delta as updates/deletes/upserts are not > open source. My personal guess, yes, it might be similar to the “eager” > approach in our doc. > > Jacques, could you share some insights how you implement the merge of > diffs? Is it done by readers? > > Thanks, > Anton > > On 10 May 2019, at 06:24, Jacques Nadeau <jacq...@dremio.com> wrote: > > This is a nice doc and it covers many different options. Upon first skim, > I don't see a strong argument for particular approach. D > > In our own development, we've been leaning heavily towards what you > describe in the document as "lazy with SRI". I believe this is consistent > with what the Hive community did on top of Orc. It's interesting because my > (maybe incorrect) understanding of the Databricks Delta approach is they > chose what you title "eager" in their approach to upserts. They may also > have a lazy approach for other types of mutations but I don't think they do. > > Thanks again for putting this together! > Jacques > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > > On Wed, May 8, 2019 at 3:42 AM Anton Okolnychyi < > aokolnyc...@apple.com.invalid> wrote: > >> Hi folks, >> >> Miguel (cc) and I have spent some time thinking about how to perform >> updates/deletes/upserts on top of Iceberg tables. This functionality is >> essential for many modern use cases. We've summarized our ideas in a doc >> [1], which, hopefully, will trigger a discussion in the community. The >> document presents different conceptual approaches alongside their >> trade-offs. We will be glad to consider any other ideas as well. >> >> Thanks, >> Anton >> >> [1] - >> https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/ >> >> >> > >