Re: Materialized view integration with REST spec

Jan Kaul Thu, 29 Feb 2024 05:42:19 -0800

Hi all,

I would like to provide my perspective on the question of what amaterialized view is and elaborate on Jack's recent proposal to view amaterialized view as a catalog concept.

Firstly, let's look at the role of the catalog. Every entity in thecatalog has a *unique identifier*, and the catalog provides methods tocreate, load, and update these entities. An important thing to note isthat the catalog methods exhibit two different behaviors: the *createand load methods deal with the entire entity*, while the *update(commit)method only deals with partial changes* to the entities.

In the context of our current discussion, materialized view (MV)metadata is a union of view and table metadata. The fact that the updatemethod deals only with partial changes, enables us to *reuse theexisting methods for updating tables and views*. For updates we don'thave to define what constitutes an entire materialized view. Changes toa materialized view targeting the properties related to the viewmetadata could use the update(commit) view method. Similarly, changestargeting the properties related to the table metadata could use theupdate(commit) table method. This is great news because we don't have toredefine view and table commits (requirements, updates).This is shown in the fact that Jack uses the same operation to updatethe storage table for Option 1 and 3:


// REST: POST /namespaces/db1/tables/mv1?materializedView=true
// non-REST: update JSON files at table_metadata_location
storageTable.newAppend().appendFile(...).commit();

The open question is *whether the create and load methods should treatthe properties that constitute the MV metadata as two entities (View +Table) or one entity (new MV object)*. This is all part of Jack'sproposal, where Option 1 proposes a new MV object, and Option 3 proposestwo separate entities. The advantage of Option 1 is that it doesn'trequire two operations to load the metadata. On the other hand, theadvantage of Option 3 is that no new operations or catalogs have to bedefined.

In my opinion, defining a new representation for materialized views(Option 1) is generally the cleaner solution. However, I see a pathwhere we could first introduce Option 3 and still have the possibilityto transition to Option 1 if needed. The great thing about Option 3 isthat it only requires minor changes to the current spec and is mostlyimplementation detail.

Therefore I would propose small additions to Jacks Option 3 that onlyintroduce changes to the spec that are not specific to materializedviews. The idea is to introduce boolean properties to be set on thecreation of the view and the storage table that indicate that theybelong to a materialized view. The view property "materialized" is setto "true" for a MV and "false" for a regular view. And the tableproperty "storage_table" is set to "true" for a storage table and"false" for a regular table. The absence of these properties indicates aregular view or table.


ViewCatalog viewCatalog = (ViewCatalog) catalog;

// REST: GET /namespaces/db1/views/mv1
// non-REST: load JSON file at metadata_location
View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));

// REST: GET /namespaces/db1/tables/mv1
// non-REST: load JSON file at table_metadata_location if present
Table storageTable = view.storageTable();

// REST: POST /namespaces/db1/tables/mv1
// non-REST: update JSON file at table_metadata_location
storageTable.newAppend().appendFile(...).commit();

We could then introduce a new requirement for views and tables called"AssertProperty" which could make sure to only perform updates that areinline with materialized views. The additional requirement can be seenas a general extension which does not need to be changed if we decide togot with Option 1 in the future.


Let me know what you think.

Best wishes,

Jan

On 29.02.24 04:09, Walaa Eldin Moustafa wrote:

Thanks Ryan for the insights. I agree that reusing existing metadatadefinitions and minimizing spec changes are very important. This alsominimizes spec drift (between materialized views and views spec, andbetween materialized views and tables spec), and simplifies theimplementation.

In an effort to take the discussion forward with concrete designoptions based on an end-to-end implementation, I have prototyped theimplementation (and added Spark support) in this PRhttps://github.com/apache/iceberg/pull/9830. I hope it helps us reachconvergence faster. More details about some of the design options arediscussed in the description of the PR.


Thanks,
Walaa.


On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote:

    I mean separate table and view metadata that is somehow combined
    through a commit process. For instance, keeping a pointer to a
    table metadata file in a view metadata file or combining commits
    to reference both. I don't see the value in either option.

    On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> wrote:

        Thanks Ryan for the help to trace back to the root question!
        Just a clarification question regarding your reply before I
        reply further: what exactly does the option "a combination of
        the two (i.e. commits are combined)" mean? How is that
        different from "a new metadata type"?

        -Jack




        On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> wrote:

            I’m catching up on this conversation, so hopefully I can
            bring a fresh perspective.

            Jack already pointed out that we need to start from the
            basics and I agree with that. Let’s remove voting at this
            point. Right now is the time for discussing trade-offs,
            not lining up and taking sides. I realize that wasn’t the
            intent with adding a vote, but that’s almost always the
            result. It’s too easy to use it as a stand-in for
            consensus and move on prematurely. I get the impression
            from the swirl in Slack that discussion has moved ahead of
            agreement.

            We’re still at the most basic question: is a materialized
            view a view and a separate table, a combination of the two
            (i.e. commits are combined), or a new metadata type?

            For now, I’m ignoring whether the “separate table” is some
            kind of “system table” (meaning hidden?) or if it is
            exposed in the catalog. That’s a later choice (already
            pointed out) and, I suspect, it should be delegated to
            catalog implementations.

            To simplify this a little, I think that we can eliminate
            the option to combine table and view commits. I don’t
            think there is a reason to combine the two. If separate, a
            table would track the view version used along with
            freshness information for referenced tables. If the table
            is automatically skipped when the version no longer
            matches the view, then no action needs to happen when a
            view definition changes. Similarly, the table can be
            updated independently without needing to also swap view
            metadata. This also aligns with the idea from the original
            doc that there can be multiple materialization tables for
            a view. Each should operate independently unless I’m
            missing something

            I don’t think the last paragraph’s conclusion is
            contentious so I’ll move on, but please stop here and
            reply if you disagree!

            That leaves the main two options, a view and a separate
            table linked by metadata, or, combined materialized view
            metadata.

            As the doc notes, the separate view and table option is
            simpler because it reuses existing metadata definitions
            and falls back to simple views. That is a significantly
            smaller spec and small is very, very important when it
            comes to specs. I think that the argument for a new
            definition of a materialized view needs to overcome this
            disadvantage.

            The arguments that I see for a combined materialized view
            object are:

              * Regular views are separate, rather than being tables
                with SQL and no data so it would be inconsistent
                (“Iceberg view is just a table with no data but with
                representations defined. But we did not do that.”)
              * Materialized views are different objects in DDL
              * Tables may be a superset of functionality needed for
                materialized views
              * Tables are not typically exposed to end users — but
                this isn’t required by the separate view and table option

            Am I missing any arguments for combined metadata?

            Ryan

--Ryan Blue

            Tabular

--Ryan Blue

    Tabular

Re: Materialized view integration with REST spec

Reply via email to