Re: Materialized view integration with REST spec

Daniel Weeks Fri, 01 Mar 2024 08:56:52 -0800

I feel I've been most vocal about pushing back against options 2+ (or
Ryan's categories of combined table/view, or new metadata type), so I'll
try to expand on my reasoning.


I understand the appeal of creating a design where we encapsulate the
view/storage from both a structural and performance standpoint, but I don't
think that is necessary and it significantly increases the complexity.

All of these approaches are aligned in one, specific way: the storage table
is an iceberg table.

Because of this, all the behaviors and requirements still apply to these
tables.  They need to be maintained (snapshot cleanup, orphan files), in
cases need to be optimized (compaction, manifest rewrites), they need to be
able to be inspected (this will be even more important with MV since
staleness can produce different results and questions will arise about what
state the storage table was in).  There may be cases where the tables need
to be managed directly.

Anywhere we deviate from the existing constructs/commit/access for tables,
we will ultimately have to then unwrap to re-expose the underlying Iceberg
behavior.  This creates unnecessary complexity in the library/API layer,
which are not the primary interface users will have with materialized views
where an engine is almost entirely necessary to interact with the dataset.

As to the performance concerns around option 1, I think we're overstating
the downsides.  It really comes down to how many metadata loads are
necessary and evaluating freshness would likely be the real bottleneck as
it involves potentially loading many tables.  All of the options are on the
same order of performance for the metadata and table loads.

As to the visibility of tables and whether they're registered in the
catalog, I think registering in the catalog is the right approach so that
the tables are still addressable for maintenance/etc.  The visibility of
the storage table is a catalog implementation decision and shouldn't be a
requirement of the MV spec (I can see cases for both and it isn't necessary
to dictate a behavior).

I'm still strongly in favor of Option 1 (separate table and view) for these
reasons.

-Dan



On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <[email protected]> wrote:

> > Jack, it sounds like you’re the proponent of a combined table and view
> (rather than a new metadata spec for a materialized view). What is the main
> motivation? It seems like you’re convinced of that approach, but I don’t
> understand the advantage it brings.
>
> Sorry I have to make a Google Sheet to capture all the options we have
> discussed so far, I wanted to use the existing Google Doc, but it has
> really bad table/sheet support...
>
>
> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>
> I have listed all the options, with how they are implemented and some
> important considerations we have discussed so far. Note that:
> 1. This sheet currently excludes the lineage information, which we can
> discuss more later after the current topic is resolved.
> 2. I removed the considerations for REST integration since from the other
> thread we have clarified that they should be considered completely
> separately.
>
> *Why I come as a proponent of having a new MV object with table and view
> metadata file pointer*
>
> In my sheet, there are 3 options that do not have major problems:
> Option 2: Add storage table metadata file pointer in view object
> Option 5: New MV object with table and view metadata file pointer
> Option 6: New MV spec with table and view metadata
>
> I originally excluded option 2 because I think it does not align with the
> REST spec, but after the other discussion thread about "Inconsistency
> between REST spec and table/view spec", I think my original concern no
> longer holds true so now I put it back. And based on my personal
> preference that MV is an independent object that should be separated from
> view and table, plus the fact that option 5 is probably less work than
> option 6 for implementation, that is how I come as a proponent of option 5
> at this moment.
>
>
> *Regarding Ryan's evaluation framework*
>
> I think we need to reconcile this sheet with Ryan's evaluation framework.
> That framework categorization puts option 2, 3, 4, 5, 6 all under the same
> category of "A combination of a view and a table" and concludes that they
> don't have any advantage for the same set of reasons. But those reasons are
> not really convincing to me so let's talk about them in more detail.
>
> (1) You said "I don’t see a reason why a combined view and table is
> advantageous" as "this would cause unnecessary dependence between the view
> and table in catalogs."  What dependency exactly do you mean here? And why
> is that unnecessary, given there has to be some sort of dependency anyway
> unless we go with option 5 or 6?
>
> (2) You said "I guess there’s an argument that you could load both table
> and view metadata locations at the same time. That hardly seems worth the
> trouble". I disagree with that. Catalog interaction performance is critical
> to at least everyone working in EMR and Athena, and MV itself as an
> acceleration approach needs to be as fast as possible.
>
> I have put 3 key operations in the doc that I think matters for MV during
> interactions with engine:
> 1. refreshes storage table
> 2. get the storage table of the MV
> 3. if stale, get the view SQL
>
> And option 1 clearly falls short with 4 sequential steps required to load
> a storage table. You mentioned "recent issues with adding views to the JDBC
> catalog" in this topic, could you explain a bit more?
>
> (3) You said "I also think that once we decide on structure, we can make
> it possible for REST catalog implementations to do smart things, in a way
> that doesn’t put additional requirements on the underlying catalog store."
> If REST is fully compatible with Iceberg spec then I have no problem with
> this statement. However, as we discussed in the other thread, it is not the
> case. In the current state, I think the sequence of action should be to
> evolve the Iceberg table/view spec (or add a MV spec) first, and then think
> about how REST can incorporate it or do smart things that are not Iceberg
> spec compliant. Do you agree with that?
>
> (4) You said the table identifier pointer "is a problem we need to solve
> generally because a materialized table needs to be able to track the
> upstream state of tables that were used". I don't think that is a reason to
> choose to use a table identifier pointer for a storage table. The issue is
> not about using a table identifier pointer. It is about exposing the
> storage table as a separate entity in the catalog, which is what people do
> not like and is already discussed in length in Jan's question 3 (also
> linked in the sheet). I agree with that statement, because without a REST
> implementation that can magically hide the storage table, this model adds
> additional burden regarding compliance and data governance for any other
> non-REST catalog implementations that are compliant to the Iceberg spec.
> Many mechanisms need to be built in a catalog to hide, protect, maintain,
> recycle the storage table, that can be avoided by using other approaches. I
> think we should reach a consensus about that and discuss further if you do
> not agree.
>
> Best,
> Jack Ye
>
> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul <[email protected]>
> wrote:
>
>> Hi Ryan, we actually discussed your categories in this question
>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>> Where your categories correspond to the following designs:
>>
>>    - Separate table and view => Design 1
>>    - Combination of view and table => Design 2
>>    - A new metadata type => Design 4
>>
>> Jan
>> On 01.03.24 00:03, Ryan Blue wrote:
>>
>> Looks like it wasn’t clear what I meant for the 3 categories, so I’ll be
>> more specific:
>>
>>    - *Separate table and view*: this option is to have the objects that
>>    we have today, with extra metadata. Commit processes are separate:
>>    committing to the table doesn’t alter the view and committing to the view
>>    doesn’t change the table. However, changing the view can make it so the
>>    table is no longer useful as a materialization.
>>    - *A combination of a view and a table*: in this option, the table
>>    metadata and view metadata are the same as the first option. The 
>> difference
>>    is that the commit process combines them, either by embedding a table
>>    metadata location in view metadata or by tracking both in the same catalog
>>    reference.
>>    - *A new metadata type*: this option is where we define a new
>>    metadata object that has view attributes, like SQL representations, along
>>    with table attributes, like partition specs and snapshots.
>>
>> Hopefully this is clear because I think much of the confusion is caused
>> by different definitions.
>>
>> The LoadTableResponse having optional metadata-location field implies
>> that the object in the catalog no longer needs to hold a metadata file
>> pointer
>>
>> The REST protocol has not removed the requirement for a metadata file, so
>> I’m going to keep focused on the MV design options.
>>
>> When we say a MV can be a “new metadata type”, it does not mean it needs
>> to define a completely brand new structure of the metadata content
>>
>> I’m making a distinction between separate metadata files for the table
>> and the view and a combined metadata object, as above.
>>
>> We can define an “Iceberg MV” to be an object in a catalog, which has 1
>> table metadata file pointer, and 1 view metadata file pointer
>>
>> This is the option I am referring to as a “combination of a view and a
>> table”.
>>
>> So to review my initial email, I don’t see a reason why a combined view
>> and table is advantageous, either implemented by having a catalog reference
>> with two metadata locations or embedding a table metadata location in view
>> metadata. This would cause unnecessary dependence between the view and
>> table in catalogs. I guess there’s an argument that you could load both
>> table and view metadata locations at the same time. That hardly seems worth
>> the trouble given the recent issues with adding views to the JDBC catalog.
>>
>> I also think that once we decide on structure, we can make it possible
>> for REST catalog implementations to do smart things, in a way that doesn’t
>> put additional requirements on the underlying catalog store. For instance,
>> we could specify how to send additional objects in a LoadViewResult, in
>> case the catalog wants to pre-fetch table metadata. I think these
>> optimizations are a later addition, after we define the relationship
>> between views and tables.
>>
>> Jack, it sounds like you’re the proponent of a combined table and view
>> (rather than a new metadata spec for a materialized view). What is the main
>> motivation? It seems like you’re convinced of that approach, but I don’t
>> understand the advantage it brings.
>>
>> Ryan
>>
>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <[email protected]>
>> wrote:
>>
>>> Hi
>>>
>>> Yes I mostly agree with the assessment.  To clarify a few minor points.
>>>
>>> is a materialized view a view and a separate table, a combination of the
>>>> two (i.e. commits are combined), or a new metadata type?
>>>
>>>
>>> For 'new metadata type', I consider mostly Jack's initial proposal of a
>>> new Catalog MV object that has two references (ViewMetadata +
>>> TableMetadata).
>>>
>>> The arguments that I see for a combined materialized view object are:
>>>>
>>>>    - Regular views are separate, rather than being tables with SQL and
>>>>    no data so it would be inconsistent (“Iceberg view is just a table with 
>>>> no
>>>>    data but with representations defined. But we did not do that.”)
>>>>
>>>>
>>>>    - Materialized views are different objects in DDL
>>>>
>>>>
>>>>    - Tables may be a superset of functionality needed for materialized
>>>>    views
>>>>
>>>>
>>>>    - Tables are not typically exposed to end users — but this isn’t
>>>>    required by the separate view and table option
>>>>
>>>> For completeness, there seem to be a few additional ones (mentioned in
>>> the Slack and above messages).
>>>
>>>    - Lack of spec change (to ViewMetadata).  But as Jack says it is a
>>>    spec change (ie, to catalogs)
>>>    - A single call to get the View's StorageTable (versus two calls)
>>>    - A more natural API, no opportunity for user to call
>>>    Catalog.dropTable() and renameTable() on storage table
>>>
>>>
>>> *Thoughts:  *I think the long discussion sessions we had on Slack
>>> was fruitful for me, as seeing the API clarified some things.
>>>
>>> I was initially more in favor of MV being a new metadata type
>>> (TableMetadata + ViewMetadata).  But seeing most of the MV operations end
>>> up being ViewCatalog or Catalog operations, I am starting to think API-wise
>>> that it may not align with the new metadata type (unless we define
>>> MVCatalog and /MV REST endpoints, which then are boilerplate wrappers).
>>>
>>> Initially one question I had for option 'a view and a separate table',
>>> was how to make this table reference (metadata.json or catalog reference).
>>> In the previous option, we had a precedent of Catalog references to
>>> Metadata, but not pointers between Metadatas.  I initially saw the proposed
>>> Catalog's TableIdentifier pointer as 'polluting' catalog concerns in
>>> ViewMetadata.  (I saw Catalog and ViewCatalog as a layer above
>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack made a fair
>>> point that ViewMetadata already is tightly bound with a Catalog.  In this
>>> case, I think this approach does have its merits as well in aligning
>>> Catalog API's with the metadata.
>>>
>>> Thanks
>>> Szehon
>>>
>>>
>>>
>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul <[email protected]>
>>> <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to provide my perspective on the question of what a
>>>> materialized view is and elaborate on Jack's recent proposal to view a
>>>> materialized view as a catalog concept.
>>>>
>>>> Firstly, let's look at the role of the catalog. Every entity in the
>>>> catalog has a *unique identifier*, and the catalog provides methods to
>>>> create, load, and update these entities. An important thing to note is that
>>>> the catalog methods exhibit two different behaviors: the *create and
>>>> load methods deal with the entire entity*, while the *update(commit)
>>>> method only deals with partial changes* to the entities.
>>>>
>>>> In the context of our current discussion, materialized view (MV)
>>>> metadata is a union of view and table metadata. The fact that the update
>>>> method deals only with partial changes, enables us to *reuse the
>>>> existing methods for updating tables and views*. For updates we don't
>>>> have to define what constitutes an entire materialized view. Changes to a
>>>> materialized view targeting the properties related to the view metadata
>>>> could use the update(commit) view method. Similarly, changes targeting the
>>>> properties related to the table metadata could use the update(commit) table
>>>> method. This is great news because we don't have to redefine view and table
>>>> commits (requirements, updates).
>>>> This is shown in the fact that Jack uses the same operation to update
>>>> the storage table for Option 1 and 3:
>>>>
>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
>>>> // non-REST: update JSON files at table_metadata_location
>>>> storageTable.newAppend().appendFile(...).commit();
>>>>
>>>> The open question is *whether the create and load methods should treat
>>>> the properties that constitute the MV metadata as two entities (View +
>>>> Table) or one entity (new MV object)*. This is all part of Jack's
>>>> proposal, where Option 1 proposes a new MV object, and Option 3 proposes
>>>> two separate entities. The advantage of Option 1 is that it doesn't require
>>>> two operations to load the metadata. On the other hand, the advantage of
>>>> Option 3 is that no new operations or catalogs have to be defined.
>>>>
>>>> In my opinion, defining a new representation for materialized views
>>>> (Option 1) is generally the cleaner solution. However, I see a path where
>>>> we could first introduce Option 3 and still have the possibility to
>>>> transition to Option 1 if needed. The great thing about Option 3 is that it
>>>> only requires minor changes to the current spec and is mostly
>>>> implementation detail.
>>>>
>>>> Therefore I would propose small additions to Jacks Option 3 that only
>>>> introduce changes to the spec that are not specific to materialized views.
>>>> The idea is to introduce boolean properties to be set on the creation of
>>>> the view and the storage table that indicate that they belong to a
>>>> materialized view. The view property "materialized" is set to "true" for a
>>>> MV and "false" for a regular view. And the table property "storage_table"
>>>> is set to "true" for a storage table and "false" for a regular table. The
>>>> absence of these properties indicates a regular view or table.
>>>>
>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>
>>>> // REST: GET /namespaces/db1/views/mv1
>>>> // non-REST: load JSON file at metadata_location
>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));
>>>>
>>>> // REST: GET /namespaces/db1/tables/mv1
>>>> // non-REST: load JSON file at table_metadata_location if present
>>>> Table storageTable = view.storageTable();
>>>>
>>>> // REST: POST /namespaces/db1/tables/mv1
>>>> // non-REST: update JSON file at table_metadata_location
>>>> storageTable.newAppend().appendFile(...).commit();
>>>>
>>>> We could then introduce a new requirement for views and tables called
>>>> "AssertProperty" which could make sure to only perform updates that are
>>>> inline with materialized views. The additional requirement can be seen as a
>>>> general extension which does not need to be changed if we decide to got
>>>> with Option 1 in the future.
>>>>
>>>> Let me know what you think.
>>>>
>>>> Best wishes,
>>>>
>>>> Jan
>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>
>>>> Thanks Ryan for the insights. I agree that reusing existing metadata
>>>> definitions and minimizing spec changes are very important. This also
>>>> minimizes spec drift (between materialized views and views spec, and
>>>> between materialized views and tables spec), and simplifies the
>>>> implementation.
>>>>
>>>> In an effort to take the discussion forward with concrete design
>>>> options based on an end-to-end implementation, I have prototyped the
>>>> implementation (and added Spark support) in this PR
>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps us reach
>>>> convergence faster. More details about some of the design options are
>>>> discussed in the description of the PR.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>>> I mean separate table and view metadata that is somehow combined
>>>>> through a commit process. For instance, keeping a pointer to a table
>>>>> metadata file in a view metadata file or combining commits to reference
>>>>> both. I don't see the value in either option.
>>>>>
>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <[email protected]> wrote:
>>>>>
>>>>>> Thanks Ryan for the help to trace back to the root question! Just a
>>>>>> clarification question regarding your reply before I reply further: what
>>>>>> exactly does the option "a combination of the two (i.e. commits are
>>>>>> combined)" mean? How is that different from "a new metadata type"?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>>> I’m catching up on this conversation, so hopefully I can bring a
>>>>>>> fresh perspective.
>>>>>>>
>>>>>>> Jack already pointed out that we need to start from the basics and I
>>>>>>> agree with that. Let’s remove voting at this point. Right now is the 
>>>>>>> time
>>>>>>> for discussing trade-offs, not lining up and taking sides. I realize 
>>>>>>> that
>>>>>>> wasn’t the intent with adding a vote, but that’s almost always the 
>>>>>>> result.
>>>>>>> It’s too easy to use it as a stand-in for consensus and move on
>>>>>>> prematurely. I get the impression from the swirl in Slack that 
>>>>>>> discussion
>>>>>>> has moved ahead of agreement.
>>>>>>>
>>>>>>> We’re still at the most basic question: is a materialized view a
>>>>>>> view and a separate table, a combination of the two (i.e. commits are
>>>>>>> combined), or a new metadata type?
>>>>>>>
>>>>>>> For now, I’m ignoring whether the “separate table” is some kind of
>>>>>>> “system table” (meaning hidden?) or if it is exposed in the catalog. 
>>>>>>> That’s
>>>>>>> a later choice (already pointed out) and, I suspect, it should be 
>>>>>>> delegated
>>>>>>> to catalog implementations.
>>>>>>>
>>>>>>> To simplify this a little, I think that we can eliminate the option
>>>>>>> to combine table and view commits. I don’t think there is a reason to
>>>>>>> combine the two. If separate, a table would track the view version used
>>>>>>> along with freshness information for referenced tables. If the table is
>>>>>>> automatically skipped when the version no longer matches the view, then 
>>>>>>> no
>>>>>>> action needs to happen when a view definition changes. Similarly, the 
>>>>>>> table
>>>>>>> can be updated independently without needing to also swap view metadata.
>>>>>>> This also aligns with the idea from the original doc that there can be
>>>>>>> multiple materialization tables for a view. Each should operate
>>>>>>> independently unless I’m missing something
>>>>>>>
>>>>>>> I don’t think the last paragraph’s conclusion is contentious so I’ll
>>>>>>> move on, but please stop here and reply if you disagree!
>>>>>>>
>>>>>>> That leaves the main two options, a view and a separate table linked
>>>>>>> by metadata, or, combined materialized view metadata.
>>>>>>>
>>>>>>> As the doc notes, the separate view and table option is simpler
>>>>>>> because it reuses existing metadata definitions and falls back to simple
>>>>>>> views. That is a significantly smaller spec and small is very, very
>>>>>>> important when it comes to specs. I think that the argument for a new
>>>>>>> definition of a materialized view needs to overcome this disadvantage.
>>>>>>>
>>>>>>> The arguments that I see for a combined materialized view object are:
>>>>>>>
>>>>>>>    - Regular views are separate, rather than being tables with SQL
>>>>>>>    and no data so it would be inconsistent (“Iceberg view is just a 
>>>>>>> table with
>>>>>>>    no data but with representations defined. But we did not do that.”)
>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>    materialized views
>>>>>>>    - Tables are not typically exposed to end users — but this isn’t
>>>>>>>    required by the separate view and table option
>>>>>>>
>>>>>>> Am I missing any arguments for combined metadata?
>>>>>>>
>>>>>>> Ryan
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>>

Re: Materialized view integration with REST spec

Reply via email to