Re: Materialized view integration with REST spec

Jan Kaul Thu, 29 Feb 2024 22:52:12 -0800

Hi Ryan, we actually discussed your categories in this question<https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.Where your categories correspond to the following designs:


 * Separate table and view => Design 1
 * Combination of view and table => Design 2
 * A new metadata type => Design 4


Jan

On 01.03.24 00:03, Ryan Blue wrote:

Looks like it wasn’t clear what I meant for the 3 categories, so I’llbe more specific:


  * /Separate table and view/: this option is to have the objects that
    we have today, with extra metadata. Commit processes are separate:
    committing to the table doesn’t alter the view and committing to
    the view doesn’t change the table. However, changing the view can
    make it so the table is no longer useful as a materialization.
  * /A combination of a view and a table/: in this option, the table
    metadata and view metadata are the same as the first option. The
    difference is that the commit process combines them, either by
    embedding a table metadata location in view metadata or by
    tracking both in the same catalog reference.
  * /A new metadata type/: this option is where we define a new
    metadata object that has view attributes, like SQL
    representations, along with table attributes, like partition specs
    and snapshots.

Hopefully this is clear because I think much of the confusion iscaused by different definitions.


    The LoadTableResponse having optional metadata-location field
    implies that the object in the catalog no longer needs to hold a
    metadata file pointer

The REST protocol has not removed the requirement for a metadata file,so I’m going to keep focused on the MV design options.


    When we say a MV can be a “new metadata type”, it does not mean it
    needs to define a completely brand new structure of the metadata
    content

I’m making a distinction between separate metadata files for the tableand the view and a combined metadata object, as above.


    We can define an “Iceberg MV” to be an object in a catalog, which
    has 1 table metadata file pointer, and 1 view metadata file pointer

This is the option I am referring to as a “combination of a view and atable”.

So to review my initial email, I don’t see a reason why a combinedview and table is advantageous, either implemented by having a catalogreference with two metadata locations or embedding a table metadatalocation in view metadata. This would cause unnecessary dependencebetween the view and table in catalogs. I guess there’s an argumentthat you could load both table and view metadata locations at the sametime. That hardly seems worth the trouble given the recent issues withadding views to the JDBC catalog.

I also think that once we decide on structure, we can make it possiblefor REST catalog implementations to do smart things, in a way thatdoesn’t put additional requirements on the underlying catalog store.For instance, we could specify how to send additional objects in aLoadViewResult, in case the catalog wants to pre-fetch table metadata.I think these optimizations are a later addition, after we define therelationship between views and tables.

Jack, it sounds like you’re the proponent of a combined table and view(rather than a new metadata spec for a materialized view). What is themain motivation? It seems like you’re convinced of that approach, butI don’t understand the advantage it brings.


Ryan

On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com>wrote:


    Hi

    Yes I mostly agree with the assessment.  To clarify a few minor
    points.

        is a materialized view a view and a separate table, a
        combination of the two (i.e. commits are combined), or a new

metadata type?


    For 'new metadata type', I consider mostly Jack's initial proposal
    of a new Catalog MV object that has two references (ViewMetadata +
    TableMetadata).

        The arguments that I see for a combined materialized view
        object are:

          * Regular views are separate, rather than being tables with
            SQL and no data so it would be inconsistent (“Iceberg view
            is just a table with no data but with representations
            defined. But we did not do that.”)

          * Materialized views are different objects in DDL

          * Tables may be a superset of functionality needed for
            materialized views

          * Tables are not typically exposed to end users — but this
            isn’t required by the separate view and table option

    For completeness, there seem to be a few additional ones
    (mentioned in the Slack and above messages).

      * Lack of spec change (to ViewMetadata).  But as Jack says it is
        a spec change (ie, to catalogs)
      * A single call to get the View's StorageTable (versus two calls)
      * A more natural API, no opportunity for user to call
        Catalog.dropTable() and renameTable() on storage table


    *Thoughts: *I think the long discussion sessions we had on Slack
    was fruitful for me, as seeing the API clarified some things.

    I was initially more in favor of MV being a new metadata type
    (TableMetadata + ViewMetadata).  But seeing most of the MV
    operations end up being ViewCatalog or Catalog operations, I am
    starting to think API-wise that it may not align with the new
    metadata type (unless we define MVCatalog and /MV REST endpoints,
    which then are boilerplate wrappers).

    Initially one question I had for option 'a view and a separate
    table', was how to make this table reference (metadata.json or
    catalog reference).  In the previous option, we had a precedent of
    Catalog references to Metadata, but not pointers between
    Metadatas.  I initially saw the proposed Catalog's TableIdentifier
    pointer as 'polluting' catalog concerns in ViewMetadata.  (I saw
    Catalog and ViewCatalog as a layer above TableMetadata and
    ViewMetadata).  But I think Dan in the Slack made a fair point
    that ViewMetadata already is tightly bound with a Catalog.  In
    this case, I think this approach does have its merits as well in
    aligning Catalog API's with the metadata.

    Thanks
    Szehon



    On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
    <jank...@mailbox.org.invalid> wrote:

        Hi all,

        I would like to provide my perspective on the question of what
        a materialized view is and elaborate on Jack's recent proposal
        to view a materialized view as a catalog concept.

        Firstly, let's look at the role of the catalog. Every entity
        in the catalog has a *unique identifier*, and the catalog
        provides methods to create, load, and update these entities.
        An important thing to note is that the catalog methods exhibit
        two different behaviors: the *create and load methods deal
        with the entire entity*, while the *update(commit) method only
        deals with partial changes* to the entities.

        In the context of our current discussion, materialized view
        (MV) metadata is a union of view and table metadata. The fact
        that the update method deals only with partial changes,
        enables us to *reuse the existing methods for updating tables
        and views*. For updates we don't have to define what
        constitutes an entire materialized view. Changes to a
        materialized view targeting the properties related to the view
        metadata could use the update(commit) view method. Similarly,
        changes targeting the properties related to the table metadata
        could use the update(commit) table method. This is great news
        because we don't have to redefine view and table commits
        (requirements, updates).
        This is shown in the fact that Jack uses the same operation to
        update the storage table for Option 1 and 3:

        // REST: POST /namespaces/db1/tables/mv1?materializedView=true
        // non-REST: update JSON files at table_metadata_location
        storageTable.newAppend().appendFile(...).commit();

        The open question is *whether the create and load methods
        should treat the properties that constitute the MV metadata as
        two entities (View + Table) or one entity (new MV object)*.
        This is all part of Jack's proposal, where Option 1 proposes a
        new MV object, and Option 3 proposes two separate entities.
        The advantage of Option 1 is that it doesn't require two
        operations to load the metadata. On the other hand, the
        advantage of Option 3 is that no new operations or catalogs
        have to be defined.

        In my opinion, defining a new representation for materialized
        views (Option 1) is generally the cleaner solution. However, I
        see a path where we could first introduce Option 3 and still
        have the possibility to transition to Option 1 if needed. The
        great thing about Option 3 is that it only requires minor
        changes to the current spec and is mostly implementation detail.

        Therefore I would propose small additions to Jacks Option 3
        that only introduce changes to the spec that are not specific
        to materialized views. The idea is to introduce boolean
        properties to be set on the creation of the view and the
        storage table that indicate that they belong to a materialized
        view. The view property "materialized" is set to "true" for a
        MV and "false" for a regular view. And the table property
        "storage_table" is set to "true" for a storage table and
        "false" for a regular table. The absence of these properties
        indicates a regular view or table.

        ViewCatalog viewCatalog = (ViewCatalog) catalog;

        // REST: GET /namespaces/db1/views/mv1
        // non-REST: load JSON file at metadata_location
        View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));

        // REST: GET /namespaces/db1/tables/mv1
        // non-REST: load JSON file at table_metadata_location if present
        Table storageTable = view.storageTable();

        // REST: POST /namespaces/db1/tables/mv1
        // non-REST: update JSON file at table_metadata_location
        storageTable.newAppend().appendFile(...).commit();

        We could then introduce a new requirement for views and tables
        called "AssertProperty" which could make sure to only perform
        updates that are inline with materialized views. The
        additional requirement can be seen as a general extension
        which does not need to be changed if we decide to got with
        Option 1 in the future.

        Let me know what you think.

        Best wishes,

        Jan

        On 29.02.24 04:09, Walaa Eldin Moustafa wrote:

        Thanks Ryan for the insights. I agree that reusing existing
        metadata definitions and minimizing spec changes are very
        important. This also minimizes spec drift (between
        materialized views and views spec, and between materialized
        views and tables spec), and simplifies the implementation.

        In an effort to take the discussion forward with concrete
        design options based on an end-to-end implementation, I have
        prototyped the implementation (and added Spark support) in
        this PR https://github.com/apache/iceberg/pull/9830. I hope
        it helps us reach convergence faster. More details about some
        of the design options are discussed in the description of the
        PR.

        Thanks,
        Walaa.


        On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io>
        wrote:

            I mean separate table and view metadata that is somehow
            combined through a commit process. For instance, keeping
            a pointer to a table metadata file in a view metadata
            file or combining commits to reference both. I don't see
            the value in either option.

            On Wed, Feb 28, 2024 at 5:05 PM Jack Ye
            <yezhao...@gmail.com> wrote:

                Thanks Ryan for the help to trace back to the root
                question! Just a clarification question regarding
                your reply before I reply further: what exactly does
                the option "a combination of the two (i.e. commits
                are combined)" mean? How is that different from "a
                new metadata type"?

                -Jack




                On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue
                <b...@tabular.io> wrote:

                    I’m catching up on this conversation, so
                    hopefully I can bring a fresh perspective.

                    Jack already pointed out that we need to start
                    from the basics and I agree with that. Let’s
                    remove voting at this point. Right now is the
                    time for discussing trade-offs, not lining up and
                    taking sides. I realize that wasn’t the intent
                    with adding a vote, but that’s almost always the
                    result. It’s too easy to use it as a stand-in for
                    consensus and move on prematurely. I get the
                    impression from the swirl in Slack that
                    discussion has moved ahead of agreement.

                    We’re still at the most basic question: is a
                    materialized view a view and a separate table, a
                    combination of the two (i.e. commits are
                    combined), or a new metadata type?

                    For now, I’m ignoring whether the “separate
                    table” is some kind of “system table” (meaning
                    hidden?) or if it is exposed in the catalog.
                    That’s a later choice (already pointed out) and,
                    I suspect, it should be delegated to catalog
                    implementations.

                    To simplify this a little, I think that we can
                    eliminate the option to combine table and view
                    commits. I don’t think there is a reason to
                    combine the two. If separate, a table would track
                    the view version used along with freshness
                    information for referenced tables. If the table
                    is automatically skipped when the version no
                    longer matches the view, then no action needs to
                    happen when a view definition changes. Similarly,
                    the table can be updated independently without
                    needing to also swap view metadata. This also
                    aligns with the idea from the original doc that
                    there can be multiple materialization tables for
                    a view. Each should operate independently unless
                    I’m missing something

                    I don’t think the last paragraph’s conclusion is
                    contentious so I’ll move on, but please stop here
                    and reply if you disagree!

                    That leaves the main two options, a view and a
                    separate table linked by metadata, or, combined
                    materialized view metadata.

                    As the doc notes, the separate view and table
                    option is simpler because it reuses existing
                    metadata definitions and falls back to simple
                    views. That is a significantly smaller spec and
                    small is very, very important when it comes to
                    specs. I think that the argument for a new
                    definition of a materialized view needs to
                    overcome this disadvantage.

                    The arguments that I see for a combined
                    materialized view object are:

                      * Regular views are separate, rather than being
                        tables with SQL and no data so it would be
                        inconsistent (“Iceberg view is just a table
                        with no data but with representations
                        defined. But we did not do that.”)
                      * Materialized views are different objects in DDL
                      * Tables may be a superset of functionality
                        needed for materialized views
                      * Tables are not typically exposed to end users
                        — but this isn’t required by the separate
                        view and table option

                    Am I missing any arguments for combined metadata?

                    Ryan

--Ryan Blue

                    Tabular

--Ryan Blue

            Tabular



--
Ryan Blue
Tabular

Re: Materialized view integration with REST spec

Reply via email to