Re: Summary of Iceberg Materialized View Meeting

Jan Kaul Thu, 20 Jun 2024 21:44:28 -0700

Great idea, let's try to resolve all issues in the doc until 30th Juneand if there are still some open points we will have another meeting.


Best wishes


Jan

On 20.06.24 19:10, Walaa Eldin Moustafa wrote:

Jan, I think there are a couple of open questions on the doc. Let usdiscuss them on the doc for a week, then meet again if they are stillopen?


Thanks,
Walaa.

On Fri, Jun 7, 2024 at 12:27 PM Jan Kaul <jank...@mailbox.org.invalid>wrote:


    No that's great, thank you. I'm thankful for the input.

    Jan

    Am 07.06.2024 17:53 schrieb Benny Chow <btc...@gmail.com>:

        Looks good Jan.  I'm a bit nit pick on picking good names so I
        left some comments around that to see what others think.

        Thanks

        On Fri, Jun 7, 2024 at 2:26 AM Jan Kaul
        <jank...@mailbox.org.invalid> wrote:

            Thanks Benny and Walaa for your input. I updated the doc
            
<https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing>
            to account for the changes as far as I understood. I would
            appreciate if you had a look and give me some feedback.

            If you have some open comments that are not relevant
            anymore due to the changes, please close them so that we
            can clean up the comments section a bit.

            Regards,

            Jan

            On 07.06.24 08:33, Walaa Eldin Moustafa wrote:

                * lineage state JSON structure

                On Thu, Jun 6, 2024 at 11:31 PM Walaa Eldin Moustafa
                <wa.moust...@gmail.com> wrote:

                    Hi Benny,

                    Your understanding is correct.

                    Another point that we discussed was the type of
                    APIs engines can use to conveniently update the
                    storage table with view query results as well as
                    set the snapshot summary on the output snapshot
                    (one that was produced by the update). We will
                    follow up on that separately.

                    Jan, do you want to reflect the lineage + state
                    discussion in the doc so we can iterate on the
                    lineage JSON structure?

                    Thanks,
                    Walaa.


                    On Thu, Jun 6, 2024 at 9:40 PM Benny Chow
                    <btc...@gmail.com> wrote:

                        I really enjoyed listening to the replay and
                        hearing everyone's feedback!  I'm in agreement
                        with all 3 consensus items, especially around
                        Dan's idea to separate the view's query tree
                        lineage vs materialization's lineage state.

                        I'll summarize my understanding about the
                        distinction and add a few comments:

                        Materialized View's Query Tree Lineage
                        - It's basically the SQL representation
                        converted to a distinct list of tables and views.
                        - Stored inside view versions so if you change
                        the view SQL, you can include the lineage with
                        that change.
                        - Tables support time travel so they can
                        optionally include a ref type and name/timestamp
                        - Views would NOT include the version (that's
                        part of the materialization lineage state below)
                        - I think we should use fully qualified
                        identifiers here instead of UUIDs.  Dropping
                        and re-creating a referenced table or view
                        doesn't break the view SQL so the lineage
                        should not be broken either.  I also don't
                        think we can support time travel if we used
                        table UUIDs here.
                        - Each table or view can be assigned a unique
                        sequence number. This sequence number is
                        scoped to a single view version.

                        Materialization Lineage State
                        - It's basically a lookup table for the above
                        sequence number to either a table snapshot id
                        or view version that was used at the time of
                        creating/refreshing the storage table.  For
                        views, these are nested views within the MV's
                        query tree - not the MV itself.
                        - Stored inside the table's snapshot summary
                        - Additional property "refresh-version-id" to
                        identify the MV's version.

                        In order to validate the freshness of a
                        materialization, everything above has to be
                        checked against the latest tables and views. 
                        This should cover all data and query tree
                        changes (that I can think of) such as the
                        "limit 100" example I gave in Slack
                        
<https://apache-iceberg.slack.com/archives/C06LPRD60EL/p1717476837288479?thread_ts=1717173133.294819&cid=C06LPRD60EL>.

                        Please let me know your thoughts.

                        Thanks

                        On Thu, Jun 6, 2024 at 7:53 AM
                        <russell.spit...@gmail.com> wrote:

                            Thanks for hosting it was a very helpful
                            meeting. I really hope we can do more in
                            the future to accelerate consensus on
                            other proposals.


                             I do encourage anyone on the mailing list
                            to add your comments offline as well,
                            especially if you have strong feelings.
                            Iceberg is an open project and we realize
                            not everyone can attend virtual meetings
                            and want you to know you are welcome.



                                On Jun 6, 2024, at 7:11 AM, Jan Kaul
                                <jank...@mailbox.org.invalid>
                                <mailto:jank...@mailbox.org.invalid>
                                wrote:

                                

                                Hi all,

                                thanks to all of you who attended the
                                meeting yesterday! It was great to
                                talk to you and I think we made great
                                progress. For those of you who weren't
                                able to attend the meeting, I
                                summarized the main points below:
                                *
                                Question 1*: Should we store the
                                "storage table pointer" as a view
                                property or as additional field in the
                                view metadata?

                                We reached consensus to add a *new
                                metadata field* "storage-table" to the
                                view version
                                <https://iceberg.apache.org/view-spec/#versions>
                                record that stores the identifier of
                                the the storage table. The motivation
                                for introducing a new field is that
                                this emphasizes that materialized
                                views are part of the standard and it
                                enforces a common behavior.

                                *Question 2*: Where should the
                                lineage-state information be stored?

                                We reached consensus on storing the
                                lineage-state information in the
                                *snapshot summary* of the storage
                                table. The motivation behind this is
                                that the table spec should not be
                                concerned with defining view constructs.

                                *Question 3*: How should the
                                lineage-state information be represented?

                                We reached consensus on representing
                                the lineage-state in the form of
                                nested objects and storing these as a
                                *JSON-encoded string* inside the
                                storage table snapshot summary.

                                Additionally, Dan proposed to
                                introduce a new lineage construct as
                                part of the view definition in
                                addition to the lineage-state that is
                                part of the storage table. The idea is
                                to separate the concerns. The
                                lineage-state in the storage table
                                should only capture the state of the
                                source tables at the time of the last
                                refresh, whereas the lineage
                                information in the view contains more
                                information about the source tables
                                and is responsible for resolving the
                                identifiers. We haven't really decided
                                on how the new lineage construct
                                should be represented or integrated
                                into the view metadata.

                                One point that we didn't really have
                                the time to discuss was Benny's
                                comment of also storing the version-id
                                of views in the case that the
                                materialized view is referencing a
                                view. I think we should also integrate
                                that into the spec.

                                You can find the recording of the
                                meeting here:

                                
https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing
                                
<https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing>

                                Best wishes,

                                Jan

Re: Summary of Iceberg Materialized View Meeting

Reply via email to