Re: Materialized view integration with REST spec

Szehon Ho Thu, 29 Feb 2024 12:26:48 -0800

Hi

Yes I mostly agree with the assessment.  To clarify a few minor points.


is a materialized view a view and a separate table, a combination of the
> two (i.e. commits are combined), or a new metadata type?


For 'new metadata type', I consider mostly Jack's initial proposal of a new
Catalog MV object that has two references (ViewMetadata + TableMetadata).

The arguments that I see for a combined materialized view object are:
>
>    - Regular views are separate, rather than being tables with SQL and no
>    data so it would be inconsistent (“Iceberg view is just a table with no
>    data but with representations defined. But we did not do that.”)
>
>
>    - Materialized views are different objects in DDL
>
>
>    - Tables may be a superset of functionality needed for materialized
>    views
>
>
>    - Tables are not typically exposed to end users — but this isn’t
>    required by the separate view and table option
>
> For completeness, there seem to be a few additional ones (mentioned in the
Slack and above messages).

   - Lack of spec change (to ViewMetadata).  But as Jack says it is a spec
   change (ie, to catalogs)
   - A single call to get the View's StorageTable (versus two calls)
   - A more natural API, no opportunity for user to call
   Catalog.dropTable() and renameTable() on storage table


*Thoughts:  *I think the long discussion sessions we had on Slack
was fruitful for me, as seeing the API clarified some things.

I was initially more in favor of MV being a new metadata type
(TableMetadata + ViewMetadata).  But seeing most of the MV operations end
up being ViewCatalog or Catalog operations, I am starting to think API-wise
that it may not align with the new metadata type (unless we define
MVCatalog and /MV REST endpoints, which then are boilerplate wrappers).

Initially one question I had for option 'a view and a separate table', was
how to make this table reference (metadata.json or catalog reference).  In
the previous option, we had a precedent of Catalog references to Metadata,
but not pointers between Metadatas.  I initially saw the proposed Catalog's
TableIdentifier pointer as 'polluting' catalog concerns in ViewMetadata.
(I saw Catalog and ViewCatalog as a layer above TableMetadata and
ViewMetadata).  But I think Dan in the Slack made a fair point that
ViewMetadata already is tightly bound with a Catalog.  In this case, I
think this approach does have its merits as well in aligning Catalog API's
with the metadata.

Thanks
Szehon



On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul <jank...@mailbox.org.invalid>
wrote:

> Hi all,
>
> I would like to provide my perspective on the question of what a
> materialized view is and elaborate on Jack's recent proposal to view a
> materialized view as a catalog concept.
>
> Firstly, let's look at the role of the catalog. Every entity in the
> catalog has a *unique identifier*, and the catalog provides methods to
> create, load, and update these entities. An important thing to note is that
> the catalog methods exhibit two different behaviors: the *create and load
> methods deal with the entire entity*, while the *update(commit) method
> only deals with partial changes* to the entities.
>
> In the context of our current discussion, materialized view (MV) metadata
> is a union of view and table metadata. The fact that the update method
> deals only with partial changes, enables us to *reuse the existing
> methods for updating tables and views*. For updates we don't have to
> define what constitutes an entire materialized view. Changes to a
> materialized view targeting the properties related to the view metadata
> could use the update(commit) view method. Similarly, changes targeting the
> properties related to the table metadata could use the update(commit) table
> method. This is great news because we don't have to redefine view and table
> commits (requirements, updates).
> This is shown in the fact that Jack uses the same operation to update the
> storage table for Option 1 and 3:
>
> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
> // non-REST: update JSON files at table_metadata_location
> storageTable.newAppend().appendFile(...).commit();
>
> The open question is *whether the create and load methods should treat
> the properties that constitute the MV metadata as two entities (View +
> Table) or one entity (new MV object)*. This is all part of Jack's
> proposal, where Option 1 proposes a new MV object, and Option 3 proposes
> two separate entities. The advantage of Option 1 is that it doesn't require
> two operations to load the metadata. On the other hand, the advantage of
> Option 3 is that no new operations or catalogs have to be defined.
>
> In my opinion, defining a new representation for materialized views
> (Option 1) is generally the cleaner solution. However, I see a path where
> we could first introduce Option 3 and still have the possibility to
> transition to Option 1 if needed. The great thing about Option 3 is that it
> only requires minor changes to the current spec and is mostly
> implementation detail.
>
> Therefore I would propose small additions to Jacks Option 3 that only
> introduce changes to the spec that are not specific to materialized views.
> The idea is to introduce boolean properties to be set on the creation of
> the view and the storage table that indicate that they belong to a
> materialized view. The view property "materialized" is set to "true" for a
> MV and "false" for a regular view. And the table property "storage_table"
> is set to "true" for a storage table and "false" for a regular table. The
> absence of these properties indicates a regular view or table.
>
> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>
> // REST: GET /namespaces/db1/views/mv1
> // non-REST: load JSON file at metadata_location
> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));
>
> // REST: GET /namespaces/db1/tables/mv1
> // non-REST: load JSON file at table_metadata_location if present
> Table storageTable = view.storageTable();
>
> // REST: POST /namespaces/db1/tables/mv1
> // non-REST: update JSON file at table_metadata_location
> storageTable.newAppend().appendFile(...).commit();
>
> We could then introduce a new requirement for views and tables called
> "AssertProperty" which could make sure to only perform updates that are
> inline with materialized views. The additional requirement can be seen as a
> general extension which does not need to be changed if we decide to got
> with Option 1 in the future.
>
> Let me know what you think.
>
> Best wishes,
>
> Jan
>
> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>
> Thanks Ryan for the insights. I agree that reusing existing metadata
> definitions and minimizing spec changes are very important. This also
> minimizes spec drift (between materialized views and views spec, and
> between materialized views and tables spec), and simplifies the
> implementation.
>
> In an effort to take the discussion forward with concrete design options
> based on an end-to-end implementation, I have prototyped the
> implementation (and added Spark support) in this PR
> https://github.com/apache/iceberg/pull/9830. I hope it helps us reach
> convergence faster. More details about some of the design options are
> discussed in the description of the PR.
>
> Thanks,
> Walaa.
>
>
> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote:
>
>> I mean separate table and view metadata that is somehow combined through
>> a commit process. For instance, keeping a pointer to a table metadata file
>> in a view metadata file or combining commits to reference both. I don't see
>> the value in either option.
>>
>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Thanks Ryan for the help to trace back to the root question! Just a
>>> clarification question regarding your reply before I reply further: what
>>> exactly does the option "a combination of the two (i.e. commits are
>>> combined)" mean? How is that different from "a new metadata type"?
>>>
>>> -Jack
>>>
>>>
>>>
>>>
>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> I’m catching up on this conversation, so hopefully I can bring a fresh
>>>> perspective.
>>>>
>>>> Jack already pointed out that we need to start from the basics and I
>>>> agree with that. Let’s remove voting at this point. Right now is the time
>>>> for discussing trade-offs, not lining up and taking sides. I realize that
>>>> wasn’t the intent with adding a vote, but that’s almost always the result.
>>>> It’s too easy to use it as a stand-in for consensus and move on
>>>> prematurely. I get the impression from the swirl in Slack that discussion
>>>> has moved ahead of agreement.
>>>>
>>>> We’re still at the most basic question: is a materialized view a view
>>>> and a separate table, a combination of the two (i.e. commits are combined),
>>>> or a new metadata type?
>>>>
>>>> For now, I’m ignoring whether the “separate table” is some kind of
>>>> “system table” (meaning hidden?) or if it is exposed in the catalog. That’s
>>>> a later choice (already pointed out) and, I suspect, it should be delegated
>>>> to catalog implementations.
>>>>
>>>> To simplify this a little, I think that we can eliminate the option to
>>>> combine table and view commits. I don’t think there is a reason to combine
>>>> the two. If separate, a table would track the view version used along with
>>>> freshness information for referenced tables. If the table is automatically
>>>> skipped when the version no longer matches the view, then no action needs
>>>> to happen when a view definition changes. Similarly, the table can be
>>>> updated independently without needing to also swap view metadata. This also
>>>> aligns with the idea from the original doc that there can be multiple
>>>> materialization tables for a view. Each should operate independently unless
>>>> I’m missing something
>>>>
>>>> I don’t think the last paragraph’s conclusion is contentious so I’ll
>>>> move on, but please stop here and reply if you disagree!
>>>>
>>>> That leaves the main two options, a view and a separate table linked by
>>>> metadata, or, combined materialized view metadata.
>>>>
>>>> As the doc notes, the separate view and table option is simpler because
>>>> it reuses existing metadata definitions and falls back to simple views.
>>>> That is a significantly smaller spec and small is very, very important when
>>>> it comes to specs. I think that the argument for a new definition of a
>>>> materialized view needs to overcome this disadvantage.
>>>>
>>>> The arguments that I see for a combined materialized view object are:
>>>>
>>>>    - Regular views are separate, rather than being tables with SQL and
>>>>    no data so it would be inconsistent (“Iceberg view is just a table with 
>>>> no
>>>>    data but with representations defined. But we did not do that.”)
>>>>    - Materialized views are different objects in DDL
>>>>    - Tables may be a superset of functionality needed for materialized
>>>>    views
>>>>    - Tables are not typically exposed to end users — but this isn’t
>>>>    required by the separate view and table option
>>>>
>>>> Am I missing any arguments for combined metadata?
>>>>
>>>> Ryan
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Materialized view integration with REST spec

Reply via email to