Hey everyone, I think this thread has hit a point of diminishing returns and that we still don't have a common understanding of what the options under consideration actually are.
Since we were already planning on discussing this at the next community sync, I suggest we pick this up there and use that time to align on what exactly we're considering. We can then start a new thread to lay out the designs under consideration in more detail and then have a discussion about trade-offs. Does that sound reasonable? Ryan On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > I am finding it hard to interpret the options concretely. I would also > suggest breaking the expectation/outcome to milestones. Maybe it becomes > easier if we agree to distinguish between an approach that is feasible in > the near term and another in the long term, especially if the latter > requires significant engine-side changes. > > Further, maybe it helps if we start with an option that fully reuses the > existing spec, and see how we view it in comparison with the options > discussed previously. I am sharing one below. It reuses the current spec of > Iceberg views and tables by leveraging table properties to capture > materialized view metadata. What is common (and not common) between this > and the desired representations? > > The new properties are: > Properties on a View: > > 1. > > *iceberg.materialized.view*: > - *Type*: View property > - *Purpose*: This property is used to mark whether a view is a > materialized view. If set to true, the view is treated as a > materialized view. This helps in differentiating between virtual and > materialized views within the catalog and dictates specific handling and > validation logic for materialized views. > 2. > > *iceberg.materialized.view.storage.location*: > - *Type*: View property > - *Purpose*: Specifies the location of the storage table associated > with the materialized view. This property is used for linking a > materialized view with its corresponding storage table, enabling data > management and query execution based on the stored data freshness. > > Properties on a Table: > > 1. *base.snapshot.[UUID]*: > - *Type*: Table property > - *Purpose*: These properties store the snapshot IDs of the base > tables at the time the materialized view's data was last updated. Each > property is prefixed with base.snapshot. followed by the UUID of > the base table. They are used to track whether the materialized view's > data > is up to date with the base tables by comparing these snapshot IDs with > the > current snapshot IDs of the base tables. If all the base tables' current > snapshot IDs match the ones stored in these properties, the materialized > view's data is considered fresh. > > > Thanks, > Walaa. > > > On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote: > >> > All of these approaches are aligned in one, specific way: the storage >> table is an iceberg table. >> >> I do not think that is true. I think people are aligned that we would >> like to re-use the Iceberg table metadata defined in the Iceberg table spec >> to express the data in MV, but I don't think it goes that far to say it >> must be an Iceberg table. Once you have that mindset, then of course option >> 1 (separate table and view) is the only option. >> >> > I don't think that is necessary and it significantly increases the >> complexity. >> >> And can you quantify what you mean by "significantly increases the >> complexity"? Seems like a lot of concerns are coming from the tradeoff with >> complexity. We probably all agree that using option 7 (a completely new >> metadata type) is a lot of work from scratch, that is why it is not >> favored. However, my understanding is that as long as we re-use the view >> and table metadata, then the majority of the existing logic can be reused. >> I think what we have gone through in Slack to draft the rough Java API >> shape helps here, because people can estimate the amount of effort required >> to implement it. And I don't think they are **significantly** more complex >> to implement. Could you elaborate more about the complexity that you >> imagine? >> >> -Jack >> >> >> >> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <daniel.c.we...@gmail.com> >> wrote: >> >>> I feel I've been most vocal about pushing back against options 2+ (or >>> Ryan's categories of combined table/view, or new metadata type), so I'll >>> try to expand on my reasoning. >>> >>> I understand the appeal of creating a design where we encapsulate the >>> view/storage from both a structural and performance standpoint, but I don't >>> think that is necessary and it significantly increases the complexity. >>> >>> All of these approaches are aligned in one, specific way: the storage >>> table is an iceberg table. >>> >>> Because of this, all the behaviors and requirements still apply to these >>> tables. They need to be maintained (snapshot cleanup, orphan files), in >>> cases need to be optimized (compaction, manifest rewrites), they need to be >>> able to be inspected (this will be even more important with MV since >>> staleness can produce different results and questions will arise about what >>> state the storage table was in). There may be cases where the tables need >>> to be managed directly. >>> >>> Anywhere we deviate from the existing constructs/commit/access for >>> tables, we will ultimately have to then unwrap to re-expose the underlying >>> Iceberg behavior. This creates unnecessary complexity in the library/API >>> layer, which are not the primary interface users will have with >>> materialized views where an engine is almost entirely necessary to interact >>> with the dataset. >>> >>> As to the performance concerns around option 1, I think we're >>> overstating the downsides. It really comes down to how many metadata loads >>> are necessary and evaluating freshness would likely be the real bottleneck >>> as it involves potentially loading many tables. All of the options are on >>> the same order of performance for the metadata and table loads. >>> >>> As to the visibility of tables and whether they're registered in the >>> catalog, I think registering in the catalog is the right approach so that >>> the tables are still addressable for maintenance/etc. The visibility of >>> the storage table is a catalog implementation decision and shouldn't be a >>> requirement of the MV spec (I can see cases for both and it isn't necessary >>> to dictate a behavior). >>> >>> I'm still strongly in favor of Option 1 (separate table and view) for >>> these reasons. >>> >>> -Dan >>> >>> >>> >>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> > Jack, it sounds like you’re the proponent of a combined table and >>>> view (rather than a new metadata spec for a materialized view). What is the >>>> main motivation? It seems like you’re convinced of that approach, but I >>>> don’t understand the advantage it brings. >>>> >>>> Sorry I have to make a Google Sheet to capture all the options we have >>>> discussed so far, I wanted to use the existing Google Doc, but it has >>>> really bad table/sheet support... >>>> >>>> >>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0 >>>> >>>> I have listed all the options, with how they are implemented and some >>>> important considerations we have discussed so far. Note that: >>>> 1. This sheet currently excludes the lineage information, which we can >>>> discuss more later after the current topic is resolved. >>>> 2. I removed the considerations for REST integration since from the >>>> other thread we have clarified that they should be considered completely >>>> separately. >>>> >>>> *Why I come as a proponent of having a new MV object with table and >>>> view metadata file pointer* >>>> >>>> In my sheet, there are 3 options that do not have major problems: >>>> Option 2: Add storage table metadata file pointer in view object >>>> Option 5: New MV object with table and view metadata file pointer >>>> Option 6: New MV spec with table and view metadata >>>> >>>> I originally excluded option 2 because I think it does not align with >>>> the REST spec, but after the other discussion thread about "Inconsistency >>>> between REST spec and table/view spec", I think my original concern no >>>> longer holds true so now I put it back. And based on my personal >>>> preference that MV is an independent object that should be separated from >>>> view and table, plus the fact that option 5 is probably less work than >>>> option 6 for implementation, that is how I come as a proponent of option 5 >>>> at this moment. >>>> >>>> >>>> *Regarding Ryan's evaluation framework* >>>> >>>> I think we need to reconcile this sheet with Ryan's evaluation >>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 all >>>> under the same category of "A combination of a view and a table" and >>>> concludes that they don't have any advantage for the same set of reasons. >>>> But those reasons are not really convincing to me so let's talk about them >>>> in more detail. >>>> >>>> (1) You said "I don’t see a reason why a combined view and table is >>>> advantageous" as "this would cause unnecessary dependence between the view >>>> and table in catalogs." What dependency exactly do you mean here? And why >>>> is that unnecessary, given there has to be some sort of dependency anyway >>>> unless we go with option 5 or 6? >>>> >>>> (2) You said "I guess there’s an argument that you could load both >>>> table and view metadata locations at the same time. That hardly seems worth >>>> the trouble". I disagree with that. Catalog interaction performance is >>>> critical to at least everyone working in EMR and Athena, and MV itself as >>>> an acceleration approach needs to be as fast as possible. >>>> >>>> I have put 3 key operations in the doc that I think matters for MV >>>> during interactions with engine: >>>> 1. refreshes storage table >>>> 2. get the storage table of the MV >>>> 3. if stale, get the view SQL >>>> >>>> And option 1 clearly falls short with 4 sequential steps required to >>>> load a storage table. You mentioned "recent issues with adding views to the >>>> JDBC catalog" in this topic, could you explain a bit more? >>>> >>>> (3) You said "I also think that once we decide on structure, we can >>>> make it possible for REST catalog implementations to do smart things, in a >>>> way that doesn’t put additional requirements on the underlying catalog >>>> store." If REST is fully compatible with Iceberg spec then I have no >>>> problem with this statement. However, as we discussed in the other thread, >>>> it is not the case. In the current state, I think the sequence of action >>>> should be to evolve the Iceberg table/view spec (or add a MV spec) first, >>>> and then think about how REST can incorporate it or do smart things that >>>> are not Iceberg spec compliant. Do you agree with that? >>>> >>>> (4) You said the table identifier pointer "is a problem we need to >>>> solve generally because a materialized table needs to be able to track the >>>> upstream state of tables that were used". I don't think that is a reason to >>>> choose to use a table identifier pointer for a storage table. The issue is >>>> not about using a table identifier pointer. It is about exposing the >>>> storage table as a separate entity in the catalog, which is what people do >>>> not like and is already discussed in length in Jan's question 3 (also >>>> linked in the sheet). I agree with that statement, because without a REST >>>> implementation that can magically hide the storage table, this model adds >>>> additional burden regarding compliance and data governance for any other >>>> non-REST catalog implementations that are compliant to the Iceberg spec. >>>> Many mechanisms need to be built in a catalog to hide, protect, maintain, >>>> recycle the storage table, that can be avoided by using other approaches. I >>>> think we should reach a consensus about that and discuss further if you do >>>> not agree. >>>> >>>> Best, >>>> Jack Ye >>>> >>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul <jank...@mailbox.org.invalid> >>>> wrote: >>>> >>>>> Hi Ryan, we actually discussed your categories in this question >>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>. >>>>> Where your categories correspond to the following designs: >>>>> >>>>> - Separate table and view => Design 1 >>>>> - Combination of view and table => Design 2 >>>>> - A new metadata type => Design 4 >>>>> >>>>> Jan >>>>> On 01.03.24 00:03, Ryan Blue wrote: >>>>> >>>>> Looks like it wasn’t clear what I meant for the 3 categories, so I’ll >>>>> be more specific: >>>>> >>>>> - *Separate table and view*: this option is to have the objects >>>>> that we have today, with extra metadata. Commit processes are separate: >>>>> committing to the table doesn’t alter the view and committing to the >>>>> view >>>>> doesn’t change the table. However, changing the view can make it so the >>>>> table is no longer useful as a materialization. >>>>> - *A combination of a view and a table*: in this option, the table >>>>> metadata and view metadata are the same as the first option. The >>>>> difference >>>>> is that the commit process combines them, either by embedding a table >>>>> metadata location in view metadata or by tracking both in the same >>>>> catalog >>>>> reference. >>>>> - *A new metadata type*: this option is where we define a new >>>>> metadata object that has view attributes, like SQL representations, >>>>> along >>>>> with table attributes, like partition specs and snapshots. >>>>> >>>>> Hopefully this is clear because I think much of the confusion is >>>>> caused by different definitions. >>>>> >>>>> The LoadTableResponse having optional metadata-location field implies >>>>> that the object in the catalog no longer needs to hold a metadata file >>>>> pointer >>>>> >>>>> The REST protocol has not removed the requirement for a metadata file, >>>>> so I’m going to keep focused on the MV design options. >>>>> >>>>> When we say a MV can be a “new metadata type”, it does not mean it >>>>> needs to define a completely brand new structure of the metadata content >>>>> >>>>> I’m making a distinction between separate metadata files for the table >>>>> and the view and a combined metadata object, as above. >>>>> >>>>> We can define an “Iceberg MV” to be an object in a catalog, which has >>>>> 1 table metadata file pointer, and 1 view metadata file pointer >>>>> >>>>> This is the option I am referring to as a “combination of a view and a >>>>> table”. >>>>> >>>>> So to review my initial email, I don’t see a reason why a combined >>>>> view and table is advantageous, either implemented by having a catalog >>>>> reference with two metadata locations or embedding a table metadata >>>>> location in view metadata. This would cause unnecessary dependence between >>>>> the view and table in catalogs. I guess there’s an argument that you could >>>>> load both table and view metadata locations at the same time. That hardly >>>>> seems worth the trouble given the recent issues with adding views to the >>>>> JDBC catalog. >>>>> >>>>> I also think that once we decide on structure, we can make it possible >>>>> for REST catalog implementations to do smart things, in a way that doesn’t >>>>> put additional requirements on the underlying catalog store. For instance, >>>>> we could specify how to send additional objects in a LoadViewResult, in >>>>> case the catalog wants to pre-fetch table metadata. I think these >>>>> optimizations are a later addition, after we define the relationship >>>>> between views and tables. >>>>> >>>>> Jack, it sounds like you’re the proponent of a combined table and view >>>>> (rather than a new metadata spec for a materialized view). What is the >>>>> main >>>>> motivation? It seems like you’re convinced of that approach, but I don’t >>>>> understand the advantage it brings. >>>>> >>>>> Ryan >>>>> >>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi >>>>>> >>>>>> Yes I mostly agree with the assessment. To clarify a few minor >>>>>> points. >>>>>> >>>>>> is a materialized view a view and a separate table, a combination of >>>>>>> the two (i.e. commits are combined), or a new metadata type? >>>>>> >>>>>> >>>>>> For 'new metadata type', I consider mostly Jack's initial proposal of >>>>>> a new Catalog MV object that has two references (ViewMetadata + >>>>>> TableMetadata). >>>>>> >>>>>> The arguments that I see for a combined materialized view object are: >>>>>>> >>>>>>> - Regular views are separate, rather than being tables with SQL >>>>>>> and no data so it would be inconsistent (“Iceberg view is just a >>>>>>> table with >>>>>>> no data but with representations defined. But we did not do that.”) >>>>>>> >>>>>>> >>>>>>> - Materialized views are different objects in DDL >>>>>>> >>>>>>> >>>>>>> - Tables may be a superset of functionality needed for >>>>>>> materialized views >>>>>>> >>>>>>> >>>>>>> - Tables are not typically exposed to end users — but this isn’t >>>>>>> required by the separate view and table option >>>>>>> >>>>>>> For completeness, there seem to be a few additional ones (mentioned >>>>>> in the Slack and above messages). >>>>>> >>>>>> - Lack of spec change (to ViewMetadata). But as Jack says it is >>>>>> a spec change (ie, to catalogs) >>>>>> - A single call to get the View's StorageTable (versus two calls) >>>>>> - A more natural API, no opportunity for user to call >>>>>> Catalog.dropTable() and renameTable() on storage table >>>>>> >>>>>> >>>>>> *Thoughts: *I think the long discussion sessions we had on Slack >>>>>> was fruitful for me, as seeing the API clarified some things. >>>>>> >>>>>> I was initially more in favor of MV being a new metadata type >>>>>> (TableMetadata + ViewMetadata). But seeing most of the MV operations end >>>>>> up being ViewCatalog or Catalog operations, I am starting to think >>>>>> API-wise >>>>>> that it may not align with the new metadata type (unless we define >>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate wrappers). >>>>>> >>>>>> Initially one question I had for option 'a view and a separate >>>>>> table', was how to make this table reference (metadata.json or catalog >>>>>> reference). In the previous option, we had a precedent of Catalog >>>>>> references to Metadata, but not pointers between Metadatas. I initially >>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' catalog >>>>>> concerns in ViewMetadata. (I saw Catalog and ViewCatalog as a layer >>>>>> above >>>>>> TableMetadata and ViewMetadata). But I think Dan in the Slack made a >>>>>> fair >>>>>> point that ViewMetadata already is tightly bound with a Catalog. In this >>>>>> case, I think this approach does have its merits as well in aligning >>>>>> Catalog API's with the metadata. >>>>>> >>>>>> Thanks >>>>>> Szehon >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul >>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I would like to provide my perspective on the question of what a >>>>>>> materialized view is and elaborate on Jack's recent proposal to view a >>>>>>> materialized view as a catalog concept. >>>>>>> >>>>>>> Firstly, let's look at the role of the catalog. Every entity in the >>>>>>> catalog has a *unique identifier*, and the catalog provides methods >>>>>>> to create, load, and update these entities. An important thing to note >>>>>>> is >>>>>>> that the catalog methods exhibit two different behaviors: the *create >>>>>>> and load methods deal with the entire entity*, while the *update(commit) >>>>>>> method only deals with partial changes* to the entities. >>>>>>> >>>>>>> In the context of our current discussion, materialized view (MV) >>>>>>> metadata is a union of view and table metadata. The fact that the update >>>>>>> method deals only with partial changes, enables us to *reuse the >>>>>>> existing methods for updating tables and views*. For updates we >>>>>>> don't have to define what constitutes an entire materialized view. >>>>>>> Changes >>>>>>> to a materialized view targeting the properties related to the view >>>>>>> metadata could use the update(commit) view method. Similarly, changes >>>>>>> targeting the properties related to the table metadata could use the >>>>>>> update(commit) table method. This is great news because we don't have to >>>>>>> redefine view and table commits (requirements, updates). >>>>>>> This is shown in the fact that Jack uses the same operation to >>>>>>> update the storage table for Option 1 and 3: >>>>>>> >>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true >>>>>>> // non-REST: update JSON files at table_metadata_location >>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>> >>>>>>> The open question is *whether the create and load methods should >>>>>>> treat the properties that constitute the MV metadata as two entities >>>>>>> (View >>>>>>> + Table) or one entity (new MV object)*. This is all part of Jack's >>>>>>> proposal, where Option 1 proposes a new MV object, and Option 3 proposes >>>>>>> two separate entities. The advantage of Option 1 is that it doesn't >>>>>>> require >>>>>>> two operations to load the metadata. On the other hand, the advantage of >>>>>>> Option 3 is that no new operations or catalogs have to be defined. >>>>>>> >>>>>>> In my opinion, defining a new representation for materialized views >>>>>>> (Option 1) is generally the cleaner solution. However, I see a path >>>>>>> where >>>>>>> we could first introduce Option 3 and still have the possibility to >>>>>>> transition to Option 1 if needed. The great thing about Option 3 is >>>>>>> that it >>>>>>> only requires minor changes to the current spec and is mostly >>>>>>> implementation detail. >>>>>>> >>>>>>> Therefore I would propose small additions to Jacks Option 3 that >>>>>>> only introduce changes to the spec that are not specific to materialized >>>>>>> views. The idea is to introduce boolean properties to be set on the >>>>>>> creation of the view and the storage table that indicate that they >>>>>>> belong >>>>>>> to a materialized view. The view property "materialized" is set to >>>>>>> "true" >>>>>>> for a MV and "false" for a regular view. And the table property >>>>>>> "storage_table" is set to "true" for a storage table and "false" for a >>>>>>> regular table. The absence of these properties indicates a regular view >>>>>>> or >>>>>>> table. >>>>>>> >>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog; >>>>>>> >>>>>>> // REST: GET /namespaces/db1/views/mv1 >>>>>>> // non-REST: load JSON file at metadata_location >>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1")); >>>>>>> >>>>>>> // REST: GET /namespaces/db1/tables/mv1 >>>>>>> // non-REST: load JSON file at table_metadata_location if present >>>>>>> Table storageTable = view.storageTable(); >>>>>>> >>>>>>> // REST: POST /namespaces/db1/tables/mv1 >>>>>>> // non-REST: update JSON file at table_metadata_location >>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>> >>>>>>> We could then introduce a new requirement for views and tables >>>>>>> called "AssertProperty" which could make sure to only perform updates >>>>>>> that >>>>>>> are inline with materialized views. The additional requirement can be >>>>>>> seen >>>>>>> as a general extension which does not need to be changed if we decide to >>>>>>> got with Option 1 in the future. >>>>>>> >>>>>>> Let me know what you think. >>>>>>> >>>>>>> Best wishes, >>>>>>> >>>>>>> Jan >>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote: >>>>>>> >>>>>>> Thanks Ryan for the insights. I agree that reusing existing metadata >>>>>>> definitions and minimizing spec changes are very important. This also >>>>>>> minimizes spec drift (between materialized views and views spec, and >>>>>>> between materialized views and tables spec), and simplifies the >>>>>>> implementation. >>>>>>> >>>>>>> In an effort to take the discussion forward with concrete design >>>>>>> options based on an end-to-end implementation, I have prototyped the >>>>>>> implementation (and added Spark support) in this PR >>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps us >>>>>>> reach convergence faster. More details about some of the design options >>>>>>> are >>>>>>> discussed in the description of the PR. >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>>> I mean separate table and view metadata that is somehow combined >>>>>>>> through a commit process. For instance, keeping a pointer to a table >>>>>>>> metadata file in a view metadata file or combining commits to reference >>>>>>>> both. I don't see the value in either option. >>>>>>>> >>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks Ryan for the help to trace back to the root question! Just >>>>>>>>> a clarification question regarding your reply before I reply further: >>>>>>>>> what >>>>>>>>> exactly does the option "a combination of the two (i.e. commits are >>>>>>>>> combined)" mean? How is that different from "a new metadata type"? >>>>>>>>> >>>>>>>>> -Jack >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>>>> >>>>>>>>>> I’m catching up on this conversation, so hopefully I can bring a >>>>>>>>>> fresh perspective. >>>>>>>>>> >>>>>>>>>> Jack already pointed out that we need to start from the basics >>>>>>>>>> and I agree with that. Let’s remove voting at this point. Right now >>>>>>>>>> is the >>>>>>>>>> time for discussing trade-offs, not lining up and taking sides. I >>>>>>>>>> realize >>>>>>>>>> that wasn’t the intent with adding a vote, but that’s almost always >>>>>>>>>> the >>>>>>>>>> result. It’s too easy to use it as a stand-in for consensus and move >>>>>>>>>> on >>>>>>>>>> prematurely. I get the impression from the swirl in Slack that >>>>>>>>>> discussion >>>>>>>>>> has moved ahead of agreement. >>>>>>>>>> >>>>>>>>>> We’re still at the most basic question: is a materialized view a >>>>>>>>>> view and a separate table, a combination of the two (i.e. commits are >>>>>>>>>> combined), or a new metadata type? >>>>>>>>>> >>>>>>>>>> For now, I’m ignoring whether the “separate table” is some kind >>>>>>>>>> of “system table” (meaning hidden?) or if it is exposed in the >>>>>>>>>> catalog. >>>>>>>>>> That’s a later choice (already pointed out) and, I suspect, it >>>>>>>>>> should be >>>>>>>>>> delegated to catalog implementations. >>>>>>>>>> >>>>>>>>>> To simplify this a little, I think that we can eliminate the >>>>>>>>>> option to combine table and view commits. I don’t think there is a >>>>>>>>>> reason >>>>>>>>>> to combine the two. If separate, a table would track the view >>>>>>>>>> version used >>>>>>>>>> along with freshness information for referenced tables. If the table >>>>>>>>>> is >>>>>>>>>> automatically skipped when the version no longer matches the view, >>>>>>>>>> then no >>>>>>>>>> action needs to happen when a view definition changes. Similarly, >>>>>>>>>> the table >>>>>>>>>> can be updated independently without needing to also swap view >>>>>>>>>> metadata. >>>>>>>>>> This also aligns with the idea from the original doc that there can >>>>>>>>>> be >>>>>>>>>> multiple materialization tables for a view. Each should operate >>>>>>>>>> independently unless I’m missing something >>>>>>>>>> >>>>>>>>>> I don’t think the last paragraph’s conclusion is contentious so >>>>>>>>>> I’ll move on, but please stop here and reply if you disagree! >>>>>>>>>> >>>>>>>>>> That leaves the main two options, a view and a separate table >>>>>>>>>> linked by metadata, or, combined materialized view metadata. >>>>>>>>>> >>>>>>>>>> As the doc notes, the separate view and table option is simpler >>>>>>>>>> because it reuses existing metadata definitions and falls back to >>>>>>>>>> simple >>>>>>>>>> views. That is a significantly smaller spec and small is very, very >>>>>>>>>> important when it comes to specs. I think that the argument for a new >>>>>>>>>> definition of a materialized view needs to overcome this >>>>>>>>>> disadvantage. >>>>>>>>>> >>>>>>>>>> The arguments that I see for a combined materialized view object >>>>>>>>>> are: >>>>>>>>>> >>>>>>>>>> - Regular views are separate, rather than being tables with >>>>>>>>>> SQL and no data so it would be inconsistent (“Iceberg view is >>>>>>>>>> just a table >>>>>>>>>> with no data but with representations defined. But we did not do >>>>>>>>>> that.”) >>>>>>>>>> - Materialized views are different objects in DDL >>>>>>>>>> - Tables may be a superset of functionality needed for >>>>>>>>>> materialized views >>>>>>>>>> - Tables are not typically exposed to end users — but this >>>>>>>>>> isn’t required by the separate view and table option >>>>>>>>>> >>>>>>>>>> Am I missing any arguments for combined metadata? >>>>>>>>>> >>>>>>>>>> Ryan >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Tabular >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>>> -- Ryan Blue Tabular