Hi Yes I mostly agree with the assessment. To clarify a few minor points.
is a materialized view a view and a separate table, a combination of the > two (i.e. commits are combined), or a new metadata type? For 'new metadata type', I consider mostly Jack's initial proposal of a new Catalog MV object that has two references (ViewMetadata + TableMetadata). The arguments that I see for a combined materialized view object are: > > - Regular views are separate, rather than being tables with SQL and no > data so it would be inconsistent (“Iceberg view is just a table with no > data but with representations defined. But we did not do that.”) > > > - Materialized views are different objects in DDL > > > - Tables may be a superset of functionality needed for materialized > views > > > - Tables are not typically exposed to end users — but this isn’t > required by the separate view and table option > > For completeness, there seem to be a few additional ones (mentioned in the Slack and above messages). - Lack of spec change (to ViewMetadata). But as Jack says it is a spec change (ie, to catalogs) - A single call to get the View's StorageTable (versus two calls) - A more natural API, no opportunity for user to call Catalog.dropTable() and renameTable() on storage table *Thoughts: *I think the long discussion sessions we had on Slack was fruitful for me, as seeing the API clarified some things. I was initially more in favor of MV being a new metadata type (TableMetadata + ViewMetadata). But seeing most of the MV operations end up being ViewCatalog or Catalog operations, I am starting to think API-wise that it may not align with the new metadata type (unless we define MVCatalog and /MV REST endpoints, which then are boilerplate wrappers). Initially one question I had for option 'a view and a separate table', was how to make this table reference (metadata.json or catalog reference). In the previous option, we had a precedent of Catalog references to Metadata, but not pointers between Metadatas. I initially saw the proposed Catalog's TableIdentifier pointer as 'polluting' catalog concerns in ViewMetadata. (I saw Catalog and ViewCatalog as a layer above TableMetadata and ViewMetadata). But I think Dan in the Slack made a fair point that ViewMetadata already is tightly bound with a Catalog. In this case, I think this approach does have its merits as well in aligning Catalog API's with the metadata. Thanks Szehon On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul <jank...@mailbox.org.invalid> wrote: > Hi all, > > I would like to provide my perspective on the question of what a > materialized view is and elaborate on Jack's recent proposal to view a > materialized view as a catalog concept. > > Firstly, let's look at the role of the catalog. Every entity in the > catalog has a *unique identifier*, and the catalog provides methods to > create, load, and update these entities. An important thing to note is that > the catalog methods exhibit two different behaviors: the *create and load > methods deal with the entire entity*, while the *update(commit) method > only deals with partial changes* to the entities. > > In the context of our current discussion, materialized view (MV) metadata > is a union of view and table metadata. The fact that the update method > deals only with partial changes, enables us to *reuse the existing > methods for updating tables and views*. For updates we don't have to > define what constitutes an entire materialized view. Changes to a > materialized view targeting the properties related to the view metadata > could use the update(commit) view method. Similarly, changes targeting the > properties related to the table metadata could use the update(commit) table > method. This is great news because we don't have to redefine view and table > commits (requirements, updates). > This is shown in the fact that Jack uses the same operation to update the > storage table for Option 1 and 3: > > // REST: POST /namespaces/db1/tables/mv1?materializedView=true > // non-REST: update JSON files at table_metadata_location > storageTable.newAppend().appendFile(...).commit(); > > The open question is *whether the create and load methods should treat > the properties that constitute the MV metadata as two entities (View + > Table) or one entity (new MV object)*. This is all part of Jack's > proposal, where Option 1 proposes a new MV object, and Option 3 proposes > two separate entities. The advantage of Option 1 is that it doesn't require > two operations to load the metadata. On the other hand, the advantage of > Option 3 is that no new operations or catalogs have to be defined. > > In my opinion, defining a new representation for materialized views > (Option 1) is generally the cleaner solution. However, I see a path where > we could first introduce Option 3 and still have the possibility to > transition to Option 1 if needed. The great thing about Option 3 is that it > only requires minor changes to the current spec and is mostly > implementation detail. > > Therefore I would propose small additions to Jacks Option 3 that only > introduce changes to the spec that are not specific to materialized views. > The idea is to introduce boolean properties to be set on the creation of > the view and the storage table that indicate that they belong to a > materialized view. The view property "materialized" is set to "true" for a > MV and "false" for a regular view. And the table property "storage_table" > is set to "true" for a storage table and "false" for a regular table. The > absence of these properties indicates a regular view or table. > > ViewCatalog viewCatalog = (ViewCatalog) catalog; > > // REST: GET /namespaces/db1/views/mv1 > // non-REST: load JSON file at metadata_location > View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1")); > > // REST: GET /namespaces/db1/tables/mv1 > // non-REST: load JSON file at table_metadata_location if present > Table storageTable = view.storageTable(); > > // REST: POST /namespaces/db1/tables/mv1 > // non-REST: update JSON file at table_metadata_location > storageTable.newAppend().appendFile(...).commit(); > > We could then introduce a new requirement for views and tables called > "AssertProperty" which could make sure to only perform updates that are > inline with materialized views. The additional requirement can be seen as a > general extension which does not need to be changed if we decide to got > with Option 1 in the future. > > Let me know what you think. > > Best wishes, > > Jan > > On 29.02.24 04:09, Walaa Eldin Moustafa wrote: > > Thanks Ryan for the insights. I agree that reusing existing metadata > definitions and minimizing spec changes are very important. This also > minimizes spec drift (between materialized views and views spec, and > between materialized views and tables spec), and simplifies the > implementation. > > In an effort to take the discussion forward with concrete design options > based on an end-to-end implementation, I have prototyped the > implementation (and added Spark support) in this PR > https://github.com/apache/iceberg/pull/9830. I hope it helps us reach > convergence faster. More details about some of the design options are > discussed in the description of the PR. > > Thanks, > Walaa. > > > On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote: > >> I mean separate table and view metadata that is somehow combined through >> a commit process. For instance, keeping a pointer to a table metadata file >> in a view metadata file or combining commits to reference both. I don't see >> the value in either option. >> >> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> wrote: >> >>> Thanks Ryan for the help to trace back to the root question! Just a >>> clarification question regarding your reply before I reply further: what >>> exactly does the option "a combination of the two (i.e. commits are >>> combined)" mean? How is that different from "a new metadata type"? >>> >>> -Jack >>> >>> >>> >>> >>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> I’m catching up on this conversation, so hopefully I can bring a fresh >>>> perspective. >>>> >>>> Jack already pointed out that we need to start from the basics and I >>>> agree with that. Let’s remove voting at this point. Right now is the time >>>> for discussing trade-offs, not lining up and taking sides. I realize that >>>> wasn’t the intent with adding a vote, but that’s almost always the result. >>>> It’s too easy to use it as a stand-in for consensus and move on >>>> prematurely. I get the impression from the swirl in Slack that discussion >>>> has moved ahead of agreement. >>>> >>>> We’re still at the most basic question: is a materialized view a view >>>> and a separate table, a combination of the two (i.e. commits are combined), >>>> or a new metadata type? >>>> >>>> For now, I’m ignoring whether the “separate table” is some kind of >>>> “system table” (meaning hidden?) or if it is exposed in the catalog. That’s >>>> a later choice (already pointed out) and, I suspect, it should be delegated >>>> to catalog implementations. >>>> >>>> To simplify this a little, I think that we can eliminate the option to >>>> combine table and view commits. I don’t think there is a reason to combine >>>> the two. If separate, a table would track the view version used along with >>>> freshness information for referenced tables. If the table is automatically >>>> skipped when the version no longer matches the view, then no action needs >>>> to happen when a view definition changes. Similarly, the table can be >>>> updated independently without needing to also swap view metadata. This also >>>> aligns with the idea from the original doc that there can be multiple >>>> materialization tables for a view. Each should operate independently unless >>>> I’m missing something >>>> >>>> I don’t think the last paragraph’s conclusion is contentious so I’ll >>>> move on, but please stop here and reply if you disagree! >>>> >>>> That leaves the main two options, a view and a separate table linked by >>>> metadata, or, combined materialized view metadata. >>>> >>>> As the doc notes, the separate view and table option is simpler because >>>> it reuses existing metadata definitions and falls back to simple views. >>>> That is a significantly smaller spec and small is very, very important when >>>> it comes to specs. I think that the argument for a new definition of a >>>> materialized view needs to overcome this disadvantage. >>>> >>>> The arguments that I see for a combined materialized view object are: >>>> >>>> - Regular views are separate, rather than being tables with SQL and >>>> no data so it would be inconsistent (“Iceberg view is just a table with >>>> no >>>> data but with representations defined. But we did not do that.”) >>>> - Materialized views are different objects in DDL >>>> - Tables may be a superset of functionality needed for materialized >>>> views >>>> - Tables are not typically exposed to end users — but this isn’t >>>> required by the separate view and table option >>>> >>>> Am I missing any arguments for combined metadata? >>>> >>>> Ryan >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> >