* lineage state JSON structure On Thu, Jun 6, 2024 at 11:31 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote:
> Hi Benny, > > Your understanding is correct. > > Another point that we discussed was the type of APIs engines can use to > conveniently update the storage table with view query results as well as > set the snapshot summary on the output snapshot (one that was produced by > the update). We will follow up on that separately. > > Jan, do you want to reflect the lineage + state discussion in the doc > so we can iterate on the lineage JSON structure? > > Thanks, > Walaa. > > > On Thu, Jun 6, 2024 at 9:40 PM Benny Chow <btc...@gmail.com> wrote: > >> I really enjoyed listening to the replay and hearing everyone's >> feedback! I'm in agreement with all 3 consensus items, especially around >> Dan's idea to separate the view's query tree lineage vs >> materialization's lineage state. >> >> I'll summarize my understanding about the distinction and add a few >> comments: >> >> Materialized View's Query Tree Lineage >> - It's basically the SQL representation converted to a distinct list of >> tables and views. >> - Stored inside view versions so if you change the view SQL, you can >> include the lineage with that change. >> - Tables support time travel so they can optionally include a ref type >> and name/timestamp >> - Views would NOT include the version (that's part of the materialization >> lineage state below) >> - I think we should use fully qualified identifiers here instead of >> UUIDs. Dropping and re-creating a referenced table or view doesn't break >> the view SQL so the lineage should not be broken either. I also don't >> think we can support time travel if we used table UUIDs here. >> - Each table or view can be assigned a unique sequence number. This >> sequence number is scoped to a single view version. >> >> Materialization Lineage State >> - It's basically a lookup table for the above sequence number to either a >> table snapshot id or view version that was used at the time of >> creating/refreshing the storage table. For views, these are nested views >> within the MV's query tree - not the MV itself. >> - Stored inside the table's snapshot summary >> - Additional property "refresh-version-id" to identify the MV's version. >> >> In order to validate the freshness of a materialization, everything above >> has to be checked against the latest tables and views. This should cover >> all data and query tree changes (that I can think of) such as the "limit >> 100" example I gave in Slack >> <https://apache-iceberg.slack.com/archives/C06LPRD60EL/p1717476837288479?thread_ts=1717173133.294819&cid=C06LPRD60EL> >> . >> >> Please let me know your thoughts. >> >> Thanks >> >> On Thu, Jun 6, 2024 at 7:53 AM <russell.spit...@gmail.com> wrote: >> >>> Thanks for hosting it was a very helpful meeting. I really hope we can >>> do more in the future to accelerate consensus on other proposals. >>> >>> >>> I do encourage anyone on the mailing list to add your comments offline >>> as well, especially if you have strong feelings. Iceberg is an open project >>> and we realize not everyone can attend virtual meetings and want you to >>> know you are welcome. >>> >>> >>> >>> On Jun 6, 2024, at 7:11 AM, Jan Kaul <jank...@mailbox.org.invalid> >>> wrote: >>> >>> >>> >>> Hi all, >>> >>> thanks to all of you who attended the meeting yesterday! It was great to >>> talk to you and I think we made great progress. For those of you who >>> weren't able to attend the meeting, I summarized the main points below: >>> >>> * Question 1*: Should we store the "storage table pointer" as a view >>> property or as additional field in the view metadata? >>> >>> We reached consensus to add a *new metadata field* "storage-table" to >>> the view version <https://iceberg.apache.org/view-spec/#versions> >>> record that stores the identifier of the the storage table. The motivation >>> for introducing a new field is that this emphasizes that materialized views >>> are part of the standard and it enforces a common behavior. >>> >>> *Question 2*: Where should the lineage-state information be stored? >>> >>> We reached consensus on storing the lineage-state information in the >>> *snapshot >>> summary* of the storage table. The motivation behind this is that the >>> table spec should not be concerned with defining view constructs. >>> >>> *Question 3*: How should the lineage-state information be represented? >>> >>> We reached consensus on representing the lineage-state in the form of >>> nested objects and storing these as a *JSON-encoded string* inside the >>> storage table snapshot summary. >>> >>> Additionally, Dan proposed to introduce a new lineage construct as part >>> of the view definition in addition to the lineage-state that is part of the >>> storage table. The idea is to separate the concerns. The lineage-state in >>> the storage table should only capture the state of the source tables at the >>> time of the last refresh, whereas the lineage information in the view >>> contains more information about the source tables and is responsible for >>> resolving the identifiers. We haven't really decided on how the new lineage >>> construct should be represented or integrated into the view metadata. >>> >>> One point that we didn't really have the time to discuss was Benny's >>> comment of also storing the version-id of views in the case that the >>> materialized view is referencing a view. I think we should also integrate >>> that into the spec. >>> >>> You can find the recording of the meeting here: >>> >>> >>> https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing >>> >>> Best wishes, >>> >>> Jan >>> >>>