Ryan, in the option "Separate table and view", will there be a reference (or pointer) to the table from the view metadata? Since the option of "embedding a table metadata location in view metadata" is not preferred, it is not clear how to associate the table with the view in the "Separate table and view" option without such a pointer.
Thanks, Walaa. On Thu, Feb 29, 2024 at 3:04 PM Ryan Blue <b...@tabular.io> wrote: > Looks like it wasn’t clear what I meant for the 3 categories, so I’ll be > more specific: > > - *Separate table and view*: this option is to have the objects that > we have today, with extra metadata. Commit processes are separate: > committing to the table doesn’t alter the view and committing to the view > doesn’t change the table. However, changing the view can make it so the > table is no longer useful as a materialization. > - *A combination of a view and a table*: in this option, the table > metadata and view metadata are the same as the first option. The difference > is that the commit process combines them, either by embedding a table > metadata location in view metadata or by tracking both in the same catalog > reference. > - *A new metadata type*: this option is where we define a new metadata > object that has view attributes, like SQL representations, along with table > attributes, like partition specs and snapshots. > > Hopefully this is clear because I think much of the confusion is caused by > different definitions. > > The LoadTableResponse having optional metadata-location field implies that > the object in the catalog no longer needs to hold a metadata file pointer > > The REST protocol has not removed the requirement for a metadata file, so > I’m going to keep focused on the MV design options. > > When we say a MV can be a “new metadata type”, it does not mean it needs > to define a completely brand new structure of the metadata content > > I’m making a distinction between separate metadata files for the table and > the view and a combined metadata object, as above. > > We can define an “Iceberg MV” to be an object in a catalog, which has 1 > table metadata file pointer, and 1 view metadata file pointer > > This is the option I am referring to as a “combination of a view and a > table”. > > So to review my initial email, I don’t see a reason why a combined view > and table is advantageous, either implemented by having a catalog reference > with two metadata locations or embedding a table metadata location in view > metadata. This would cause unnecessary dependence between the view and > table in catalogs. I guess there’s an argument that you could load both > table and view metadata locations at the same time. That hardly seems worth > the trouble given the recent issues with adding views to the JDBC catalog. > > I also think that once we decide on structure, we can make it possible for > REST catalog implementations to do smart things, in a way that doesn’t put > additional requirements on the underlying catalog store. For instance, we > could specify how to send additional objects in a LoadViewResult, in case > the catalog wants to pre-fetch table metadata. I think these optimizations > are a later addition, after we define the relationship between views and > tables. > > Jack, it sounds like you’re the proponent of a combined table and view > (rather than a new metadata spec for a materialized view). What is the main > motivation? It seems like you’re convinced of that approach, but I don’t > understand the advantage it brings. > > Ryan > > On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com> > wrote: > >> Hi >> >> Yes I mostly agree with the assessment. To clarify a few minor points. >> >> is a materialized view a view and a separate table, a combination of the >>> two (i.e. commits are combined), or a new metadata type? >> >> >> For 'new metadata type', I consider mostly Jack's initial proposal of a >> new Catalog MV object that has two references (ViewMetadata + >> TableMetadata). >> >> The arguments that I see for a combined materialized view object are: >>> >>> - Regular views are separate, rather than being tables with SQL and >>> no data so it would be inconsistent (“Iceberg view is just a table with >>> no >>> data but with representations defined. But we did not do that.”) >>> >>> >>> - Materialized views are different objects in DDL >>> >>> >>> - Tables may be a superset of functionality needed for materialized >>> views >>> >>> >>> - Tables are not typically exposed to end users — but this isn’t >>> required by the separate view and table option >>> >>> For completeness, there seem to be a few additional ones (mentioned in >> the Slack and above messages). >> >> - Lack of spec change (to ViewMetadata). But as Jack says it is a >> spec change (ie, to catalogs) >> - A single call to get the View's StorageTable (versus two calls) >> - A more natural API, no opportunity for user to call >> Catalog.dropTable() and renameTable() on storage table >> >> >> *Thoughts: *I think the long discussion sessions we had on Slack >> was fruitful for me, as seeing the API clarified some things. >> >> I was initially more in favor of MV being a new metadata type >> (TableMetadata + ViewMetadata). But seeing most of the MV operations end >> up being ViewCatalog or Catalog operations, I am starting to think API-wise >> that it may not align with the new metadata type (unless we define >> MVCatalog and /MV REST endpoints, which then are boilerplate wrappers). >> >> Initially one question I had for option 'a view and a separate table', >> was how to make this table reference (metadata.json or catalog reference). >> In the previous option, we had a precedent of Catalog references to >> Metadata, but not pointers between Metadatas. I initially saw the proposed >> Catalog's TableIdentifier pointer as 'polluting' catalog concerns in >> ViewMetadata. (I saw Catalog and ViewCatalog as a layer above >> TableMetadata and ViewMetadata). But I think Dan in the Slack made a fair >> point that ViewMetadata already is tightly bound with a Catalog. In this >> case, I think this approach does have its merits as well in aligning >> Catalog API's with the metadata. >> >> Thanks >> Szehon >> >> >> >> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul <jank...@mailbox.org.invalid> >> wrote: >> >>> Hi all, >>> >>> I would like to provide my perspective on the question of what a >>> materialized view is and elaborate on Jack's recent proposal to view a >>> materialized view as a catalog concept. >>> >>> Firstly, let's look at the role of the catalog. Every entity in the >>> catalog has a *unique identifier*, and the catalog provides methods to >>> create, load, and update these entities. An important thing to note is that >>> the catalog methods exhibit two different behaviors: the *create and >>> load methods deal with the entire entity*, while the *update(commit) >>> method only deals with partial changes* to the entities. >>> >>> In the context of our current discussion, materialized view (MV) >>> metadata is a union of view and table metadata. The fact that the update >>> method deals only with partial changes, enables us to *reuse the >>> existing methods for updating tables and views*. For updates we don't >>> have to define what constitutes an entire materialized view. Changes to a >>> materialized view targeting the properties related to the view metadata >>> could use the update(commit) view method. Similarly, changes targeting the >>> properties related to the table metadata could use the update(commit) table >>> method. This is great news because we don't have to redefine view and table >>> commits (requirements, updates). >>> This is shown in the fact that Jack uses the same operation to update >>> the storage table for Option 1 and 3: >>> >>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true >>> // non-REST: update JSON files at table_metadata_location >>> storageTable.newAppend().appendFile(...).commit(); >>> >>> The open question is *whether the create and load methods should treat >>> the properties that constitute the MV metadata as two entities (View + >>> Table) or one entity (new MV object)*. This is all part of Jack's >>> proposal, where Option 1 proposes a new MV object, and Option 3 proposes >>> two separate entities. The advantage of Option 1 is that it doesn't require >>> two operations to load the metadata. On the other hand, the advantage of >>> Option 3 is that no new operations or catalogs have to be defined. >>> >>> In my opinion, defining a new representation for materialized views >>> (Option 1) is generally the cleaner solution. However, I see a path where >>> we could first introduce Option 3 and still have the possibility to >>> transition to Option 1 if needed. The great thing about Option 3 is that it >>> only requires minor changes to the current spec and is mostly >>> implementation detail. >>> >>> Therefore I would propose small additions to Jacks Option 3 that only >>> introduce changes to the spec that are not specific to materialized views. >>> The idea is to introduce boolean properties to be set on the creation of >>> the view and the storage table that indicate that they belong to a >>> materialized view. The view property "materialized" is set to "true" for a >>> MV and "false" for a regular view. And the table property "storage_table" >>> is set to "true" for a storage table and "false" for a regular table. The >>> absence of these properties indicates a regular view or table. >>> >>> ViewCatalog viewCatalog = (ViewCatalog) catalog; >>> >>> // REST: GET /namespaces/db1/views/mv1 >>> // non-REST: load JSON file at metadata_location >>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1")); >>> >>> // REST: GET /namespaces/db1/tables/mv1 >>> // non-REST: load JSON file at table_metadata_location if present >>> Table storageTable = view.storageTable(); >>> >>> // REST: POST /namespaces/db1/tables/mv1 >>> // non-REST: update JSON file at table_metadata_location >>> storageTable.newAppend().appendFile(...).commit(); >>> >>> We could then introduce a new requirement for views and tables called >>> "AssertProperty" which could make sure to only perform updates that are >>> inline with materialized views. The additional requirement can be seen as a >>> general extension which does not need to be changed if we decide to got >>> with Option 1 in the future. >>> >>> Let me know what you think. >>> >>> Best wishes, >>> >>> Jan >>> >>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote: >>> >>> Thanks Ryan for the insights. I agree that reusing existing metadata >>> definitions and minimizing spec changes are very important. This also >>> minimizes spec drift (between materialized views and views spec, and >>> between materialized views and tables spec), and simplifies the >>> implementation. >>> >>> In an effort to take the discussion forward with concrete design options >>> based on an end-to-end implementation, I have prototyped the >>> implementation (and added Spark support) in this PR >>> https://github.com/apache/iceberg/pull/9830. I hope it helps us reach >>> convergence faster. More details about some of the design options are >>> discussed in the description of the PR. >>> >>> Thanks, >>> Walaa. >>> >>> >>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> I mean separate table and view metadata that is somehow combined >>>> through a commit process. For instance, keeping a pointer to a table >>>> metadata file in a view metadata file or combining commits to reference >>>> both. I don't see the value in either option. >>>> >>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> Thanks Ryan for the help to trace back to the root question! Just a >>>>> clarification question regarding your reply before I reply further: what >>>>> exactly does the option "a combination of the two (i.e. commits are >>>>> combined)" mean? How is that different from "a new metadata type"? >>>>> >>>>> -Jack >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> I’m catching up on this conversation, so hopefully I can bring a >>>>>> fresh perspective. >>>>>> >>>>>> Jack already pointed out that we need to start from the basics and I >>>>>> agree with that. Let’s remove voting at this point. Right now is the time >>>>>> for discussing trade-offs, not lining up and taking sides. I realize that >>>>>> wasn’t the intent with adding a vote, but that’s almost always the >>>>>> result. >>>>>> It’s too easy to use it as a stand-in for consensus and move on >>>>>> prematurely. I get the impression from the swirl in Slack that discussion >>>>>> has moved ahead of agreement. >>>>>> >>>>>> We’re still at the most basic question: is a materialized view a view >>>>>> and a separate table, a combination of the two (i.e. commits are >>>>>> combined), >>>>>> or a new metadata type? >>>>>> >>>>>> For now, I’m ignoring whether the “separate table” is some kind of >>>>>> “system table” (meaning hidden?) or if it is exposed in the catalog. >>>>>> That’s >>>>>> a later choice (already pointed out) and, I suspect, it should be >>>>>> delegated >>>>>> to catalog implementations. >>>>>> >>>>>> To simplify this a little, I think that we can eliminate the option >>>>>> to combine table and view commits. I don’t think there is a reason to >>>>>> combine the two. If separate, a table would track the view version used >>>>>> along with freshness information for referenced tables. If the table is >>>>>> automatically skipped when the version no longer matches the view, then >>>>>> no >>>>>> action needs to happen when a view definition changes. Similarly, the >>>>>> table >>>>>> can be updated independently without needing to also swap view metadata. >>>>>> This also aligns with the idea from the original doc that there can be >>>>>> multiple materialization tables for a view. Each should operate >>>>>> independently unless I’m missing something >>>>>> >>>>>> I don’t think the last paragraph’s conclusion is contentious so I’ll >>>>>> move on, but please stop here and reply if you disagree! >>>>>> >>>>>> That leaves the main two options, a view and a separate table linked >>>>>> by metadata, or, combined materialized view metadata. >>>>>> >>>>>> As the doc notes, the separate view and table option is simpler >>>>>> because it reuses existing metadata definitions and falls back to simple >>>>>> views. That is a significantly smaller spec and small is very, very >>>>>> important when it comes to specs. I think that the argument for a new >>>>>> definition of a materialized view needs to overcome this disadvantage. >>>>>> >>>>>> The arguments that I see for a combined materialized view object are: >>>>>> >>>>>> - Regular views are separate, rather than being tables with SQL >>>>>> and no data so it would be inconsistent (“Iceberg view is just a >>>>>> table with >>>>>> no data but with representations defined. But we did not do that.”) >>>>>> - Materialized views are different objects in DDL >>>>>> - Tables may be a superset of functionality needed for >>>>>> materialized views >>>>>> - Tables are not typically exposed to end users — but this isn’t >>>>>> required by the separate view and table option >>>>>> >>>>>> Am I missing any arguments for combined metadata? >>>>>> >>>>>> Ryan >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> > > -- > Ryan Blue > Tabular >