Re: Materialized view integration with REST spec

Jan Kaul Wed, 21 Feb 2024 12:14:06 -0800

Thanks Micah, I think the voting chips are great.

@Szehon, actually what I had in mind was not to have one thread perquestion but rather have smaller threads that can be resolved moreeasily. I have the fear that one thread for the current question wouldlead to a very long and unmanageable discussion.

I've added another row to the table where everyone could provide asummary of their reason for choosing a certain design. This way we couldmove some of the content from the comment threads to the main document.


On 21.02.24 19:58, Micah Kornfield wrote:


    Of course we also need threads that express our preferences
    (voting). I would suggest to keep these separate from discussions
    about single points so that they can be persisted in the document.

Not sure if it helpful, but I added voting chips Question 0, as maybean easier way to keep track of votes. If it is helpful, I can addthem in other places that still need a vote (I think one needs a paidGoogle Docs account to insert them).


Thanks,
Micah

On Wed, Feb 21, 2024 at 10:23 AM Szehon Ho <[email protected]>wrote:

Thanks Jan. +1 on having just one thread per question for
vote/preference. Where do you suggest we have it, on the
discussion question itself? It would be to keep the existing
threads and move it there.

Also, I think it makes sense with making a slack channel (for
quick question, reply) , and also discuss unresolved questions in
the next week's sync or a separate meeting.

On Wed, Feb 21, 2024 at 12:40 AM Jan Kaul
<[email protected]> wrote:

Thank you Jack for driving the consensus for the MV spec and
thank you all for the discussion.

I really like the idea about incremental consensus because we
often loose sight in detailed discussions. As Jack mentioned,
the highest priority question currently is: *Should the
Iceberg MV be realized as a view + storage table or do we
define a new metadata format?

*To have one place for the discussion, I created another
Question

(https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi)
to the Materialized View Spec google document.

To improve the visibility of the arguments I would like to
propose a new process. It would be great if all relevant
information is stored in the document itself. Therefore I
would suggest to use the comment threads for smaller,
temporary discussions which can be resolved by adding the
points to the main document. Please close the threads if the
information was added to the document. Additionally, I gave
you all permissions to edit the documents, so you can add
missing points yourselves.

Of course we also need threads that express our preferences
(voting). I would suggest to keep these separate from
discussions about single points so that they can be persisted
in the document.

After a phase of collecting arguments for the different
designs I think it would make sense to have video call to have
a face to face discussion.

What do you think?

Best wishes,

Jan

On 20.02.24 21:32, Manish Malhotra wrote:

        Very excited for MV to be in Iceberg :)
        Keeping in the same doc. would be helpful, to have the trail.
        But also agreed, if there are too many directions/threads,
        then keep closing the old one, if there are no more questions.
        And put down the assumptions for the initial version to move
        forward.


        On Tue, Feb 20, 2024 at 12:17 PM Walaa Eldin Moustafa
        <[email protected]> wrote:

            I would vote to keep a log in the doc with open
            questions, and keep the doc updated with open questions
            as they arise/get resolved.

            On Tue, Feb 20, 2024 at 11:37 AM Jack Ye
            <[email protected]> wrote:

                Thanks for the response from everyone!

                Before proceeding further, I see a few people
                referring back to the current design from Jan. I
                specifically raised this thread based on the
                information in the doc and a few latest discussions
                we had there. Because there are many threads in the
                doc, and each thread points further to other
                discussion threads in the same doc or other doc, it
                is now quite hard to follow and continue discussing
                all different topics there.

                I hope we can make incremental consensus of the
                questions in the doc through devlist, because it
                provides more visibility, and also a single thread
                instead of multiple threads going on at the same
                time. If we think this format is not effective, I
                propose that we create a new mv channel in Iceberg
                Slack workspace, and people interested can join and
                discuss all these points directly. What do we think?

                Best,
                Jack Ye



                On Mon, Feb 19, 2024 at 6:03 PM Szehon Ho
                <[email protected]> wrote:

                    Hi,

                    Great to see more discussion on the MV spec. 
                    Actually, Jan's document "Iceberg Materialized
                    View Spec"
                    
<https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A>
 has
                    been organized , with a "Design Questions"
                    section to track these debates, and it would be
                    nice to centralize the debates there, as Micah
                    mentions.

                    For Dan's question, I think this debate was
                    tracked in "DesignQuestion 3: Should the storage
                    table be registered in the catalog?". I think the
                    general idea there was to not expose it directly
                    via Catalog as it is then exposed to user
                    modification. If the engine wants to access
                    anything about the storage table (including audit
                    and storage), it is of course there via the
                    storage table pointer. I think Walaa's point is
                    also good, we could expose it as we expose
                    metadata tables, but I am still not sure if there
                    is still some use-cases of engine access not covered?

                    It is true that for Jack's initial question (Do
                    we really want to go with the MV = view + storage
                    table design approach for Iceberg MV?),
                    unfortunately we did not capture it as a "Design
                    Question" in Jan's doc, as it was an implicit
                    assumption of 'yes', because it is the choice of
                    Hive, Trino, and other engines , as others have
                    pointed out.

                    Jack's point about potential evolution of MV
                    (like to add partitioning) is an interesting one,
                    but definitely hard to grasp.  I think it makes
                    sense to add this as a separate Design Question
                    in the doc, and add the options.  This will allow
                    us to flesh out this alternative
                    option(s).  Maybe Micah's point about modifying
                    existing proposal to 'embed' the required table
                    metadata fields in the existing view metadata, is
                    one middle ground option.  Or we add a totally
                    new MV object spec for MV, separate than existing
                    View spec?

                    Also , as Jack pointed out, it may make sense to
                    have the REST / Catalog API proposal in the doc
                    to educate the above decision.

                    Thanks
                    Szehon

                    On Mon, Feb 19, 2024 at 4:08 PM Walaa Eldin
                    Moustafa <[email protected]> wrote:

                        I think it would help if we answer the
                        question of whether an MV is a view + storage
                        table (and degree of exposing this underlying
                        implementation) in the context of the user
                        interfacing with those concepts:

                        For the end user, interfacing with the engine
                        APIs (e.g., through SQL), materialized view
                        APIs should be almost the same as regular
                        view APIs (except for operations specific to
                        materialized views like REFRESH command etc).
                        Typically, the end user interacts with the
                        (materialized) view object as a view, and the
                        engine performs the abstraction over the
                        storage table.

                        For the engines interfacing with Iceberg, it
                        sounds the correct abstraction at this layer
                        is indeed view + storage table, and engines
                        could have access to both objects to optimize
                        queries.

                        So in a sense, the engine will
                        ultimately hide most of the storage detail
                        from the end user (except for advanced users
                        who want to explicitly access the storage
                        table with a modifier like
                        "db.view.storageTable" -- and they can only
                        read it), while Iceberg will expose the
                        storage details to the engine catalog to use
                        it in scans if needed. So the storage table
                        is hidden or exposed based on the context/the
                        actual users. From Iceberg point of view
                        (which interacts with the engines), the
                        storage table is exposed. Note that this does
                        not necessarily mean that the storage table
                        is registered in the catalog with its own
                        independent name (e.g., where we can drop the
                        view but keep the storage table and access it
                        from the catalog). Addressing the storage
                        table using a virtual namespace like
                        "db.view.storageTable" sounds like a good
                        middle ground. Anyways, end users should not
                        need to directly access the storage table in
                        most cases.

                        Thanks,
                        Walaa.

                        On Mon, Feb 19, 2024 at 3:38 PM Micah
                        Kornfield <[email protected]> wrote:

                            Hi Jack,

                                In my mind, the first key point we
                                all need to agree upon to move this
                                design forward is*: Do we really want
                                to go with the MV = view + storage
                                table design approach for Iceberg MV?*


                            I think we want this to the extent that
                            we do not want to redefine the same
                            concept with different
                            representations/naming to the greatest
                            degree possible. This is why borrowing
                            the concepts from the view (e.g. multiple
                            ways of expressing the same view logic in
                            different dialects) and aspects of the
                            materialized data (e.g. partitioning,
                            ordering) feels most natural. IIUC your
                            proposal, I think you are saying maybe
                            two modifications to the existing
                            proposals in the document:

                            1.  No separate storage table link,
                            instead embed most of the metadata of the
                            materialized table into the MV document
                            (the exception seems to be snapshot history)
                            2.  For snapshot history, have one
                            unified history specific to the MV.

                            This seems fairly reasonable to me and I
                            think I can solve some challenges with
                            the existing proposal in an elegant way. 
                            If this is correct (or maybe if it isn't
                            quite correct) perhaps you can make
                            suggestions to the document so all of the
                            trade-offs can be discussed in one place?

                            I think the one thing the current draft
                            of the materialized view ignores is how
                            to store algebraic summaries (e.g.
                            separate sum and count for averages, or
                            other sketches), so that new data can be
                            incrementally incorporated.  But
                            representing these structures feels like
                            it potentially has value beyond just MVs
                            (e.g. it can be a natural way to
                            express summary statistics in table
                            metadata), so I think it deserves at
                            least a try in incorporating the concepts
                            in the table specification, so the
                            definitions can be shared.  I was
                            imagining this could come as part of the
                            next revision of MV specification.

                                The MV internal structure could
                                evolve in a way that works more
                                efficiently with the reduced scope of
                                functionalities, without relying on
                                table to offer the same capabilities.
                                I can at least say that is true based
                                on my internal knowledge of how

Redshift MVs work.


                            I'm not sure I fully understand this
                            point, but it seems the main question
                            here is what would break if it started to
                            evolve in this direction.  Is it purely
                            additive or do we suspect some elements
                            would need to be removed?  My gut feeling
                            here is the main concerns here are
                            getting the cardinatities correct (i.e. 1
                            MV should probably have 0, 1 or more
                            materialized storage tables associated
                            with it, to support more advanced
                            algebraic structures listed above, and
                            perhaps a second without them, and
                            additional metadata to distinguish
                            between these two different modes).

                                If after the evaluation, we are
                                confident that the MV = view +
                                storage table approach is the right
                                way to go, then we can debate the
                                other issues, and I think the next
                                issue to reach consensus should be
                                "Should the storage table be
                                registered in the catalog?".


                            I actually think there are actually more
                            fundamental questions posed:
                            1.  Should be considering how items
                            should be modelled in the REST API
                            concurrently with the Iceberg spec, as
                            that potentially impacts design decision
                            (I think the answer is yes, and we should
                            update the doc with sketches on new
                            endpoints and operations on the endpoints
                            to ensure things align).
                            2.  Going forward should new aspects of
                            Iceberg artifacts rely on the fact that a
                            catalog is present and we can rely on a
                            naming convention for looking up other
                            artifacts in a catalog as pointers (I
                            lean yes on this, but I'm a little bit
                            more ambivalent).

                            Thanks,
                            Micah

                            On Mon, Feb 19, 2024 at 12:52 PM Jack Ye
                            <[email protected]> wrote:

                                I suggest we need a step-by-step
                                process to make incremental
                                consensus, otherwise we are
                                constantly talking about many
                                different debates at the same time.

                                In my mind, the first key point we
                                all need to agree upon to move this
                                design forward is*: Do we really want
                                to go with the MV = view + storage
                                table design approach for Iceberg MV?*

                                I think we (at least me) started with
                                this assumption, mostly because this
                                is how Trino implements MV, and how
                                Hive tables store MV information
                                today. But does it mean we should
                                design it that way in Iceberg?

                                Now I look back at how we did the
                                view spec design, we could also say
                                that we just add a representation
                                field in the table spec to store
                                view, and an Iceberg view is just a
                                table with no data but with
                                representations defined. But we did
                                not do that. So it feels now quite
                                inconsistent to say we want to just
                                add a few fields in the table and
                                view spec to call it an Iceberg MV.

                                If we look into most of the other
                                database systems (e.g. Redshift,
                                BigQuery, Snowflake), they never
                                expose such implementation details
                                like storage table. Apart from being
                                close-sourced systems, I think it is
                                also for good technical reasons.
                                There are many more things that a
                                table needs to support, but does not
                                really apply to MV. The MV internal
                                structure could evolve in a way that
                                works more efficiently with the
                                reduced scope of functionalities,
                                without relying on table to offer the
                                same capabilities. I can at least say
                                that is true based on my internal
                                knowledge of how Redshift MVs work.

                                I think we should fully evaluate both
                                directions, and commit to one first
                                before debating more things.

                                If we have a new and independent
                                Iceberg MV spec, then an Iceberg MV
                                is under-the-hood a single object
                                containing all MV information. It has
                                its own name, snapshots, view
                                representation, etc. I don't believe
                                we will be blocked by Trino due to
                                its MV SPIs currently requiring the
                                existence of a storage table, as it
                                will just be a different
                                implementation from the existing one
                                in Trino-Iceberg. In this direction,
                                I don't think we need to have any
                                further debate about pointers,
                                metadata locations, storage table,
                                etc. because everything will be new.

                                If after the evaluation, we are
                                confident that the MV = view +
                                storage table approach is the right
                                way to go, then we can debate the
                                other issues, and I think the next
                                issue to reach consensus should be
                                "Should the storage table be
                                registered in the catalog?".

                                What do we think?

                                -Jack




                                On Mon, Feb 19, 2024 at 11:32 AM
                                Daniel Weeks <[email protected]> wrote:

                                    Jack,

                                    I think we should consider either
                                    allowing the storage table to be
                                    fully exposed/addressable via the
                                    catalog or allow access via
                                    namespacing like with metadata
                                    tables.  E.g.
                                    <catalog>.<database>.<table>.<storage>,
                                    which would allow for full access
                                    to the underlying table.

                                    For other engines to interact
                                    with the storage table (e.g. to
                                    execute the query to materialize
                                    the table), it may be necessary
                                    that the table is fully
                                    addressable.  Whether the storage
                                    table is returned as part of list
                                    operations may be something we
                                    leave up to the catalog
                                    implementation.

                                    I don't think the table should
                                    reference a physical location
                                    (only a logical reference) since
                                    things will be changing behind
                                    the view definition and I'm not
                                    confident we want to have to
                                    update the view representation
                                    everytime the storage table is
                                    updated.

                                    I think there's still some
                                    exploration as to whether we need
                                    to model this as separate from
                                    view endpoints, but there may be
                                    enough overlap that it's not
                                    necessary to have yet another set
                                    of endpoints for materialized
                                    views (maybe filter params if you
                                    need to distinguish?).

                                    -Dan



                                    On Sun, Feb 18, 2024 at 6:57 PM
                                    Renjie Liu
                                    <[email protected]> wrote:

                                        Hi, Jack:

                                        Thanks for raising this.

                                            In most database systems,
                                            MV, view and table are
                                            considered independent
                                            objects, at least at API
                                            level. It is very rare
                                            for a system to support
                                            operations like
                                            "materializing a logical
                                            view" or "upgrading a
                                            logical view to MV",
                                            because view and MV are
                                            very different in almost
                                            every aspect of user
                                            experience. Extending the
                                            existing view or table
                                            spec to accommodate MV
                                            might give us a MV
                                            implementation similar to
                                            the current Trino or Hive
                                            views, save us some
                                            effort and a few APIs in
                                            REST, but it binds us to
                                            a very specific design of
                                            MV, which we might regret
                                            in the future.


                                        When I reviewed the doc, I
                                        thought we were discussing
                                        the spec of materialized
                                        view, just like the spec of
                                        table metadata, but didn't
                                        not the user facing api. I
                                        would definitely agree that
                                        we should consider MV as
                                        another kind of database
                                        object in user facing api,
                                        even though it's internally
                                        modelled as a view + storage
                                        table pointer.

                                            If we want to make the
                                            REST experience good for
                                            MV, I think we should at
                                            least consider directly
                                            describing the full
                                            metadata of the storage
                                            table in Iceberg view,
                                            instead of pointing to a
                                            JSON file.


                                        Do you mean we need to add
                                        components like
                                        `LoadMaterializedViewResponse`,
                                        if so, I would +1 for this.

                                            *Q2: what REST APIs do we
                                            expect to use for
                                            interactions with MVs?*


                                        As I have mentioned above, I
                                        think we should consider MV
                                        as another database object,
                                        so I think we should add a
                                        set of apis specifically
                                        designed for MV, such as
                                        `loadMV`, `freshMV`.

                                        On Sat, Feb 17, 2024 at
                                        11:14 AM Jack Ye
                                        <[email protected]> wrote:

                                            Hi everyone,

                                            As we are discussing the
                                            spec change for
                                            materialized view, there
                                            has been a missing aspect
                                            that is technically also
                                            related, and might affect
                                            the MV spec design: *how
                                            do we want to add MV
                                            support to the REST spec?*
                                            *
                                            *
                                            I would like to discuss
                                            this in a new thread to
                                            collect people's
                                            thoughts. This topic
                                            expands to the following
                                            2 sub-questions:

                                            *Q1: how would the MV
                                            spec change affect the
                                            REST spec?*
                                            In the current proposal,
                                            it looks like we are
                                            using a design where a MV
                                            is modeled as an Iceberg
                                            view linking to an
                                            Iceberg storage table. At
                                            the same time, we do not
                                            want to expose this
                                            storage table in the
                                            catalog, thus the Iceberg
                                            view has a pointer to
                                            only a metadata JSON file
                                            of the Iceberg storage
                                            table. Each MV refresh
                                            updates the pointer to a
                                            new metadata JSON file.

                                            I feel this does not play
                                            very well with the
                                            direction that REST is
                                            going. The REST catalog
                                            is trying to remove the
                                            dependency to the
                                            metadata JSON file. For
                                            example, in
                                            LoadTableResponse the
                                            only required field is
                                            the metadata, and
                                            metadata-location is
                                            actually optional.

                                            If we want to make the
                                            REST experience good for
                                            MV, I think we should at
                                            least consider directly
                                            describing the full
                                            metadata of the storage
                                            table in Iceberg view,
                                            instead of pointing to a
                                            JSON file.

                                            *Q2: what REST APIs do we
                                            expect to use for
                                            interactions with MVs?*
                                            So far we have been
                                            thinking about
                                            amending the view spec to
                                            accommodate MV. This
                                            entails likely having MVs
                                            also being handled
                                            through the view APIs in
                                            REST spec.

                                            We need to agree with
                                            that first in the
                                            community, because this
                                            has various implications,
                                            and I am not really sure
                                            at this point if it is
                                            the best way to go.

                                            If MV interactions are
                                            through the view APIs,
                                            the view APIs need to be
                                            updated to accommodate MV
                                            constructs that are not
                                            really related to logical
                                            views. In fact, most
                                            actions performed on MVs
                                            are more similar to
                                            actions performed on
                                            table rather than view,
                                            which involve configuring
                                            data layout, read and
                                            write constructs. For
                                            example, users might run
                                            something like:

                                            CREATE MATERIALIZED VIEW mv
                                            PARTITION BY col1
                                            CLUSTER BY col2
                                            AS ( // some sql )

                                            then the CreateView API
                                            needs to accept partition
                                            spec and sort order that
                                            are completely not
                                            relevant for logical views.

                                            When reading a MV, we
                                            might even want to have a
                                            PlanMaterializedView API
                                            similar to the PlanTable
                                            API we are adding.

                                            *My personal take*
                                            It feels like we need to
                                            reconsider the question
                                            of what is the best way
                                            to model MV in Iceberg.
                                            Should it be (1) a view
                                            linked to a storage
                                            table, or (2) a table
                                            with a view SQL
                                            associated with it, or
                                            (3) it's a completely
                                            independent thing. This
                                            topic was discussed in
                                            the past in this doc
                                            
<https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/edit?pli=1>,
                                            but at that time we did
                                            not have much perspective
                                            about aspects like REST
                                            spec, and the view
                                            integration was also not
                                            fully completed yet. With
                                            the new knowledge,
                                            currently I am actually
                                            leaning a bit more
                                            towards (3).

                                            In most database systems,
                                            MV, view and table are
                                            considered independent
                                            objects, at least at API
                                            level. It is very rare
                                            for a system to support
                                            operations like
                                            "materializing a logical
                                            view" or "upgrading a
                                            logical view to MV",
                                            because view and MV are
                                            very different in almost
                                            every aspect of user
                                            experience. Extending the
                                            existing view or table
                                            spec to accommodate MV
                                            might give us a MV
                                            implementation similar to
                                            the current Trino or Hive
                                            views, save us some
                                            effort and a few APIs in
                                            REST, but it binds us to
                                            a very specific design of
                                            MV, which we might regret
                                            in the future.

                                            If we make a new MV spec,
                                            it can be made up of
                                            fields that already exist
                                            in the table and view
                                            specs, but it is a whole
                                            new spec. In this way,
                                            the spec can evolve
                                            independently to
                                            accommodate MV specific
                                            features, and we can also
                                            create MV-related REST
                                            endpoints that will
                                            evolve independently from
                                            table and view REST APIs.

                                            But on the other side it
                                            is definitely associated
                                            with more work to
                                            maintain a new spec, and
                                            potentially big
                                            refactoring in the
                                            codebase to make sure
                                            operations today that
                                            work on table or view can
                                            now support MV as a
                                            different object. And it
                                            definitely has other
                                            problems that I have
                                            overlooked. I would
                                            greatly appreciate any
                                            thoughts about this!

                                            Best,
                                            Jack Ye

Re: Materialized view integration with REST spec

Reply via email to