Thanks Benny and Walaa for your input. I updated the doc <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing> to account for the changes as far as I understood. I would appreciate if you had a look and give me some feedback.

If you have some open comments that are not relevant anymore due to the changes, please close them so that we can clean up the comments section a bit.

Regards,

Jan

On 07.06.24 08:33, Walaa Eldin Moustafa wrote:
* lineage state JSON structure

On Thu, Jun 6, 2024 at 11:31 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote:

    Hi Benny,

    Your understanding is correct.

    Another point that we discussed was the type of APIs engines can
    use to conveniently update the storage table with view query
    results as well as set the snapshot summary on the output snapshot
    (one that was produced by the update). We will follow up on that
    separately.

    Jan, do you want to reflect the lineage + state discussion in the
    doc so we can iterate on the lineage JSON structure?

    Thanks,
    Walaa.


    On Thu, Jun 6, 2024 at 9:40 PM Benny Chow <btc...@gmail.com> wrote:

        I really enjoyed listening to the replay and hearing
        everyone's feedback!  I'm in agreement with all 3 consensus
        items, especially around Dan's idea to separate the view's
        query tree lineage vs materialization's lineage state.

        I'll summarize my understanding about the distinction and add
        a few comments:

        Materialized View's Query Tree Lineage
        - It's basically the SQL representation converted to a
        distinct list of tables and views.
        - Stored inside view versions so if you change the view SQL,
        you can include the lineage with that change.
        - Tables support time travel so they can optionally include a
        ref type and name/timestamp
        - Views would NOT include the version (that's part of the
        materialization lineage state below)
        - I think we should use fully qualified identifiers here
        instead of UUIDs.  Dropping and re-creating a referenced table
        or view doesn't break the view SQL so the lineage should not
        be broken either.  I also don't think we can support time
        travel if we used table UUIDs here.
        - Each table or view can be assigned a unique sequence
        number.  This sequence number is scoped to a single view version.

        Materialization Lineage State
        - It's basically a lookup table for the above sequence number
        to either a table snapshot id or view version that was used at
        the time of creating/refreshing the storage table.  For views,
        these are nested views within the MV's query tree - not the MV
        itself.
        - Stored inside the table's snapshot summary
        - Additional property "refresh-version-id" to identify the
        MV's version.

        In order to validate the freshness of a materialization,
        everything above has to be checked against the latest tables
        and views.  This should cover all data and query tree changes
        (that I can think of) such as the "limit 100" example I gave
        in Slack
        
<https://apache-iceberg.slack.com/archives/C06LPRD60EL/p1717476837288479?thread_ts=1717173133.294819&cid=C06LPRD60EL>.

        Please let me know your thoughts.

        Thanks

        On Thu, Jun 6, 2024 at 7:53 AM <russell.spit...@gmail.com> wrote:

            Thanks for hosting it was a very helpful meeting. I really
            hope we can do more in the future to accelerate consensus
            on other proposals.


             I do encourage anyone on the mailing list to add your
            comments offline as well, especially if you have strong
            feelings. Iceberg is an open project and we realize not
            everyone can attend virtual meetings and want you to know
            you are welcome.



            On Jun 6, 2024, at 7:11 AM, Jan Kaul
            <jank...@mailbox.org.invalid> wrote:

            

            Hi all,

            thanks to all of you who attended the meeting yesterday!
            It was great to talk to you and I think we made great
            progress. For those of you who weren't able to attend the
            meeting, I summarized the main points below:
            *
            Question 1*: Should we store the "storage table pointer"
            as a view property or as additional field in the view
            metadata?

            We reached consensus to add a *new metadata field*
            "storage-table" to the view version
            <https://iceberg.apache.org/view-spec/#versions> record
            that stores the identifier of the the storage table. The
            motivation for introducing a new field is that this
            emphasizes that materialized views are part of the
            standard and it enforces a common behavior.

            *Question 2*: Where should the lineage-state information
            be stored?

            We reached consensus on storing the lineage-state
            information in the *snapshot summary* of the storage
            table. The motivation behind this is that the table spec
            should not be concerned with defining view constructs.

            *Question 3*: How should the lineage-state information be
            represented?

            We reached consensus on representing the lineage-state in
            the form of nested objects and storing these as a
            *JSON-encoded string* inside the storage table snapshot
            summary.

            Additionally, Dan proposed to introduce a new lineage
            construct as part of the view definition in addition to
            the lineage-state that is part of the storage table. The
            idea is to separate the concerns. The lineage-state in
            the storage table should only capture the state of the
            source tables at the time of the last refresh, whereas
            the lineage information in the view contains more
            information about the source tables and is responsible
            for resolving the identifiers. We haven't really decided
            on how the new lineage construct should be represented or
            integrated into the view metadata.

            One point that we didn't really have the time to discuss
            was Benny's comment of also storing the version-id of
            views in the case that the materialized view is
            referencing a view. I think we should also integrate that
            into the spec.

            You can find the recording of the meeting here:

            
https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing
            
<https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing>

            Best wishes,

            Jan

Reply via email to