Re: Materialized view integration with REST spec

Jan Kaul Wed, 21 Feb 2024 00:40:21 -0800

Thank you Jack for driving the consensus for the MV spec and thank youall for the discussion.

I really like the idea about incremental consensus because we oftenloose sight in detailed discussions. As Jack mentioned, the highestpriority question currently is: *Should the Iceberg MV be realized as aview + storage table or do we define a new metadata format?

*To have one place for the discussion, I created another Question(https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi)to the Materialized View Spec google document.

To improve the visibility of the arguments I would like to propose a newprocess. It would be great if all relevant information is stored in thedocument itself. Therefore I would suggest to use the comment threadsfor smaller, temporary discussions which can be resolved by adding thepoints to the main document. Please close the threads if the informationwas added to the document. Additionally, I gave you all permissions toedit the documents, so you can add missing points yourselves.

Of course we also need threads that express our preferences (voting). Iwould suggest to keep these separate from discussions about singlepoints so that they can be persisted in the document.

After a phase of collecting arguments for the different designs I thinkit would make sense to have video call to have a face to face discussion.


What do you think?

Best wishes,

Jan

On 20.02.24 21:32, Manish Malhotra wrote:

Very excited for MV to be in Iceberg :)
Keeping in the same doc. would be helpful, to have the trail.

But also agreed, if there are too many directions/threads, then keepclosing the old one, if there are no more questions.

And put down the assumptions for the initial version to move forward.

On Tue, Feb 20, 2024 at 12:17 PM Walaa Eldin Moustafa<wa.moust...@gmail.com> wrote:


    I would vote to keep a log in the doc with open questions, and
    keep the doc updated with open questions as they arise/get resolved.

    On Tue, Feb 20, 2024 at 11:37 AM Jack Ye <yezhao...@gmail.com> wrote:

        Thanks for the response from everyone!

        Before proceeding further, I see a few people referring back
        to the current design from Jan. I specifically raised this
        thread based on the information in the doc and a few latest
        discussions we had there. Because there are many threads in
        the doc, and each thread points further to other discussion
        threads in the same doc or other doc, it is now quite hard to
        follow and continue discussing all different topics there.

        I hope we can make incremental consensus of the questions in
        the doc through devlist, because it provides more visibility,
        and also a single thread instead of multiple threads going on
        at the same time. If we think this format is not effective, I
        propose that we create a new mv channel in Iceberg Slack
        workspace, and people interested can join and discuss all
        these points directly. What do we think?

        Best,
        Jack Ye



        On Mon, Feb 19, 2024 at 6:03 PM Szehon Ho
        <szehon.apa...@gmail.com> wrote:

            Hi,

            Great to see more discussion on the MV spec. Actually,
            Jan's document "Iceberg Materialized View Spec"
            
<https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A>
 has
            been organized , with a "Design Questions" section to
            track these debates, and it would be nice to centralize
            the debates there, as Micah mentions.

            For Dan's question, I think this debate was tracked in
            "DesignQuestion 3: Should the storage table be registered
            in the catalog?". I think the general idea there was to
            not expose it directly via Catalog as it is then exposed
            to user modification. If the engine wants to access
            anything about the storage table (including audit and
            storage), it is of course there via the storage table
            pointer. I think Walaa's point is also good, we could
            expose it as we expose metadata tables, but I am still not
            sure if there is still some use-cases of engine access not
            covered?

            It is true that for Jack's initial question (Do we really
            want to go with the MV = view + storage table design
            approach for Iceberg MV?), unfortunately we did not
            capture it as a "Design Question" in Jan's doc, as it was
            an implicit assumption of 'yes', because it is the choice
            of Hive, Trino, and other engines , as others have pointed
            out.

            Jack's point about potential evolution of MV (like to add
            partitioning) is an interesting one, but definitely hard
            to grasp.  I think it makes sense to add this as a
            separate Design Question in the doc, and add the options. 
            This will allow us to flesh out this alternative
            option(s).  Maybe Micah's point about modifying existing
            proposal to 'embed' the required table metadata fields in
            the existing view metadata, is one middle ground option. 
            Or we add a totally new MV object spec for MV, separate
            than existing View spec?

            Also , as Jack pointed out, it may make sense to have the
            REST / Catalog API proposal in the doc to educate the
            above decision.

            Thanks
            Szehon

            On Mon, Feb 19, 2024 at 4:08 PM Walaa Eldin Moustafa
            <wa.moust...@gmail.com> wrote:

                I think it would help if we answer the question of
                whether an MV is a view + storage table (and degree of
                exposing this underlying implementation) in the
                context of the user interfacing with those concepts:

                For the end user, interfacing with the engine APIs
                (e.g., through SQL), materialized view APIs should be
                almost the same as regular view APIs (except for
                operations specific to materialized views like REFRESH
                command etc). Typically, the end user interacts with
                the (materialized) view object as a view, and the
                engine performs the abstraction over the storage table.

                For the engines interfacing with Iceberg, it sounds
                the correct abstraction at this layer is indeed view +
                storage table, and engines could have access to both
                objects to optimize queries.

                So in a sense, the engine will ultimately hide most of
                the storage detail from the end user (except for
                advanced users who want to explicitly access the
                storage table with a modifier like
                "db.view.storageTable" -- and they can only read it),
                while Iceberg will expose the storage details to the
                engine catalog to use it in scans if needed. So the
                storage table is hidden or exposed based on the
                context/the actual users. From Iceberg point of view
                (which interacts with the engines), the storage table
                is exposed. Note that this does not necessarily mean
                that the storage table is registered in the catalog
                with its own independent name (e.g., where we can drop
                the view but keep the storage table and access it from
                the catalog). Addressing the storage table using a
                virtual namespace like "db.view.storageTable" sounds
                like a good middle ground. Anyways, end users should
                not need to directly access the storage table in most
                cases.

                Thanks,
                Walaa.

                On Mon, Feb 19, 2024 at 3:38 PM Micah Kornfield
                <emkornfi...@gmail.com> wrote:

                    Hi Jack,

                        In my mind, the first key point we all need to
                        agree upon to move this design forward is*: Do
                        we really want to go with the MV = view +
                        storage table design approach for Iceberg MV?*


                    I think we want this to the extent that we do not
                    want to redefine the same concept with different
                    representations/naming to the greatest degree
                    possible.  This is why borrowing the concepts from
                    the view (e.g. multiple ways of expressing the
                    same view logic in different dialects) and aspects
                    of the materialized data (e.g. partitioning,
                    ordering) feels most natural.  IIUC your proposal,
                    I think you are saying maybe two modifications to
                    the existing proposals in the document:

                    1.  No separate storage table link, instead embed
                    most of the metadata of the materialized table
                    into the MV document (the exception seems to be
                    snapshot history)
                    2.  For snapshot history, have one unified history
                    specific to the MV.

                    This seems fairly reasonable to me and I think I
                    can solve some challenges with the existing
                    proposal in an elegant way. If this is correct (or
                    maybe if it isn't quite correct) perhaps you can
                    make suggestions to the document so all of the
                    trade-offs can be discussed in one place?

                    I think the one thing the current draft of the
                    materialized view ignores is how to store
                    algebraic summaries (e.g. separate sum and count
                    for averages, or other sketches), so that new data
                    can be incrementally incorporated.  But
                    representing these structures feels like it
                    potentially has value beyond just MVs (e.g. it can
                    be a natural way to express summary statistics in
                    table metadata), so I think it deserves at least a
                    try in incorporating the concepts in the table
                    specification, so the definitions can be shared. 
                    I was imagining this could come as part of the
                    next revision of MV specification.

                        The MV internal structure could evolve in a
                        way that works more efficiently with the
                        reduced scope of functionalities, without
                        relying on table to offer the same
                        capabilities. I can at least say that is true
                        based on my internal knowledge of how Redshift

MVs work.


                    I'm not sure I fully understand this point, but it
                    seems the main question here is what would break
                    if it started to evolve in this direction.  Is it
                    purely additive or do we suspect some elements
                    would need to be removed?  My gut feeling here is
                    the main concerns here are getting the
                    cardinatities correct (i.e. 1 MV should probably
                    have 0, 1 or more materialized storage tables
                    associated with it, to support more advanced
                    algebraic structures listed above, and perhaps a
                    second without them, and additional metadata to
                    distinguish between these two different modes).

                        If after the evaluation, we are confident that
                        the MV = view + storage table approach is the
                        right way to go, then we can debate the other
                        issues, and I think the next issue to reach
                        consensus should be "Should the storage table
                        be registered in the catalog?".


                    I actually think there are actually more
                    fundamental questions posed:
                    1.  Should be considering how items should be
                    modelled in the REST API concurrently with the
                    Iceberg spec, as that potentially impacts design
                    decision (I think the answer is yes, and we should
                    update the doc with sketches on new endpoints and
                    operations on the endpoints to ensure things align).
                    2.  Going forward should new aspects of Iceberg
                    artifacts rely on the fact that a catalog is
                    present and we can rely on a naming convention for
                    looking up other artifacts in a catalog as
                    pointers (I lean yes on this, but I'm a little bit
                    more ambivalent).

                    Thanks,
                    Micah

                    On Mon, Feb 19, 2024 at 12:52 PM Jack Ye
                    <yezhao...@gmail.com> wrote:

                        I suggest we need a step-by-step process to
                        make incremental consensus, otherwise we are
                        constantly talking about many different
                        debates at the same time.

                        In my mind, the first key point we all need to
                        agree upon to move this design forward is*: Do
                        we really want to go with the MV = view +
                        storage table design approach for Iceberg MV?*

                        I think we (at least me) started with this
                        assumption, mostly because this is how Trino
                        implements MV, and how Hive tables store MV
                        information today. But does it mean we should
                        design it that way in Iceberg?

                        Now I look back at how we did the view spec
                        design, we could also say that we just add a
                        representation field in the table spec to
                        store view, and an Iceberg view is just a
                        table with no data but with representations
                        defined. But we did not do that. So it feels
                        now quite inconsistent to say we want to just
                        add a few fields in the table and view spec to
                        call it an Iceberg MV.

                        If we look into most of the other database
                        systems (e.g. Redshift, BigQuery, Snowflake),
                        they never expose such implementation details
                        like storage table. Apart from being
                        close-sourced systems, I think it is also for
                        good technical reasons. There are many more
                        things that a table needs to support, but does
                        not really apply to MV. The MV internal
                        structure could evolve in a way that works
                        more efficiently with the reduced scope of
                        functionalities, without relying on table to
                        offer the same capabilities. I can at least
                        say that is true based on my internal
                        knowledge of how Redshift MVs work.

                        I think we should fully evaluate both
                        directions, and commit to one first before
                        debating more things.

                        If we have a new and independent Iceberg MV
                        spec, then an Iceberg MV is under-the-hood a
                        single object containing all MV information.
                        It has its own name, snapshots, view
                        representation, etc. I don't believe we will
                        be blocked by Trino due to its MV SPIs
                        currently requiring the existence of a storage
                        table, as it will just be a different
                        implementation from the existing one in
                        Trino-Iceberg. In this direction, I don't
                        think we need to have any further debate about
                        pointers, metadata locations, storage table,
                        etc. because everything will be new.

                        If after the evaluation, we are confident that
                        the MV = view + storage table approach is the
                        right way to go, then we can debate the other
                        issues, and I think the next issue to reach
                        consensus should be "Should the storage table
                        be registered in the catalog?".

                        What do we think?

                        -Jack




                        On Mon, Feb 19, 2024 at 11:32 AM Daniel Weeks
                        <dwe...@apache.org> wrote:

                            Jack,

                            I think we should consider either allowing
                            the storage table to be fully
                            exposed/addressable via the catalog or
                            allow access via namespacing like with
                            metadata tables.  E.g.
                            <catalog>.<database>.<table>.<storage>,
                            which would allow for full access to the
                            underlying table.

                            For other engines to interact with the
                            storage table (e.g. to execute the query
                            to materialize the table), it may be
                            necessary that the table is fully
                            addressable.  Whether the storage table is
                            returned as part of list operations may be
                            something we leave up to the catalog
                            implementation.

                            I don't think the table should reference a
                            physical location (only a logical
                            reference) since things will be
                            changing behind the view definition and
                            I'm not confident we want to have to
                            update the view representation everytime
                            the storage table is updated.

                            I think there's still some exploration as
                            to whether we need to model this as
                            separate from view endpoints, but there
                            may be enough overlap that it's not
                            necessary to have yet another set of
                            endpoints for materialized views (maybe
                            filter params if you need to distinguish?).

                            -Dan



                            On Sun, Feb 18, 2024 at 6:57 PM Renjie Liu
                            <liurenjie2...@gmail.com> wrote:

                                Hi, Jack:

                                Thanks for raising this.

                                    In most database systems, MV, view
                                    and table are considered
                                    independent objects, at least at
                                    API level. It is very rare for a
                                    system to support operations like
                                    "materializing a logical view" or
                                    "upgrading a logical view to MV",
                                    because view and MV are very
                                    different in almost every aspect
                                    of user experience. Extending the
                                    existing view or table spec to
                                    accommodate MV might give us a MV
                                    implementation similar to the
                                    current Trino or Hive views, save
                                    us some effort and a few APIs in
                                    REST, but it binds us to a very
                                    specific design of MV, which we
                                    might regret in the future.


                                When I reviewed the doc, I thought we
                                were discussing the spec of
                                materialized view, just like the spec
                                of table metadata, but didn't not the
                                user facing api. I would definitely
                                agree that we should consider MV as
                                another kind of database object in
                                user facing api, even though it's
                                internally modelled as a view +
                                storage table pointer.

                                    If we want to make the REST
                                    experience good for MV, I think we
                                    should at least consider directly
                                    describing the full metadata of
                                    the storage table in Iceberg view,
                                    instead of pointing to a JSON file.


                                Do you mean we need to add components
                                like `LoadMaterializedViewResponse`,
                                if so, I would +1 for this.

                                    *Q2: what REST APIs do we expect
                                    to use for interactions with MVs?*


                                As I have mentioned above, I think we
                                should consider MV as another database
                                object, so I think we should add a set
                                of apis specifically designed for MV,
                                such as `loadMV`, `freshMV`.

                                On Sat, Feb 17, 2024 at 11:14 AM Jack
                                Ye <yezhao...@gmail.com> wrote:

                                    Hi everyone,

                                    As we are discussing the spec
                                    change for materialized view,
                                    there has been a missing aspect
                                    that is technically also related,
                                    and might affect the MV spec
                                    design: *how do we want to add MV
                                    support to the REST spec?*
                                    *
                                    *
                                    I would like to discuss this in a
                                    new thread to collect people's
                                    thoughts. This topic expands to
                                    the following 2 sub-questions:

                                    *Q1: how would the MV spec change
                                    affect the REST spec?*
                                    In the current proposal, it looks
                                    like we are using a design where a
                                    MV is modeled as an Iceberg view
                                    linking to an Iceberg storage
                                    table. At the same time, we do not
                                    want to expose this storage table
                                    in the catalog, thus the Iceberg
                                    view has a pointer to only a
                                    metadata JSON file of the Iceberg
                                    storage table. Each MV refresh
                                    updates the pointer to a new
                                    metadata JSON file.

                                    I feel this does not play very
                                    well with the direction that REST
                                    is going. The REST catalog is
                                    trying to remove the dependency to
                                    the metadata JSON file. For
                                    example, in LoadTableResponse the
                                    only required field is the
                                    metadata, and metadata-location is
                                    actually optional.

                                    If we want to make the REST
                                    experience good for MV, I think we
                                    should at least consider directly
                                    describing the full metadata of
                                    the storage table in Iceberg view,
                                    instead of pointing to a JSON file.

                                    *Q2: what REST APIs do we expect
                                    to use for interactions with MVs?*
                                    So far we have been thinking about
                                    amending the view spec to
                                    accommodate MV. This entails
                                    likely having MVs also being
                                    handled through the view APIs in
                                    REST spec.

                                    We need to agree with that first
                                    in the community, because this has
                                    various implications, and I am not
                                    really sure at this point if it is
                                    the best way to go.

                                    If MV interactions are through the
                                    view APIs, the view APIs need to
                                    be updated to accommodate MV
                                    constructs that are not really
                                    related to logical views. In fact,
                                    most actions performed on MVs are
                                    more similar to actions performed
                                    on table rather than view, which
                                    involve configuring data layout,
                                    read and write constructs. For
                                    example, users might run something
                                    like:

                                    CREATE MATERIALIZED VIEW mv
                                    PARTITION BY col1
                                    CLUSTER BY col2
                                    AS ( // some sql )

                                    then the CreateView API needs to
                                    accept partition spec and sort
                                    order that are completely not
                                    relevant for logical views.

                                    When reading a MV, we might even
                                    want to have a
                                    PlanMaterializedView API similar
                                    to the PlanTable API we are adding.

                                    *My personal take*
                                    It feels like we need to
                                    reconsider the question of what is
                                    the best way to model MV in
                                    Iceberg. Should it be (1) a view
                                    linked to a storage table, or (2)
                                    a table with a view SQL associated
                                    with it, or (3) it's a completely
                                    independent thing. This topic was
                                    discussed in the past in this doc
                                    
<https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/edit?pli=1>,
                                    but at that time we did not have
                                    much perspective about aspects
                                    like REST spec, and the view
                                    integration was also not fully
                                    completed yet. With the new
                                    knowledge, currently I am actually
                                    leaning a bit more towards (3).

                                    In most database systems, MV, view
                                    and table are considered
                                    independent objects, at least at
                                    API level. It is very rare for a
                                    system to support operations like
                                    "materializing a logical view" or
                                    "upgrading a logical view to MV",
                                    because view and MV are very
                                    different in almost every aspect
                                    of user experience. Extending the
                                    existing view or table spec to
                                    accommodate MV might give us a MV
                                    implementation similar to the
                                    current Trino or Hive views, save
                                    us some effort and a few APIs in
                                    REST, but it binds us to a very
                                    specific design of MV, which we
                                    might regret in the future.

                                    If we make a new MV spec, it can
                                    be made up of fields that already
                                    exist in the table and view specs,
                                    but it is a whole new spec. In
                                    this way, the spec can evolve
                                    independently to accommodate MV
                                    specific features, and we can also
                                    create MV-related REST endpoints
                                    that will evolve independently
                                    from table and view REST APIs.

                                    But on the other side it is
                                    definitely associated with more
                                    work to maintain a new spec, and
                                    potentially big refactoring in the
                                    codebase to make sure operations
                                    today that work on table or view
                                    can now support MV as a different
                                    object. And it definitely has
                                    other problems that I have
                                    overlooked. I would greatly
                                    appreciate any thoughts about this!

                                    Best,
                                    Jack Ye

Re: Materialized view integration with REST spec

Reply via email to