Sounds good, let's discuss this in person! I am a bit worried that we have quite a few critical topics going on right now on devlist, and this will take up a lot of time to discuss. If it ends up going for too long, l propose let us have a dedicated meeting, and I am more than happy to organize it.
Best, Jack Ye On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> wrote: > Hey everyone, > > I think this thread has hit a point of diminishing returns and that we > still don't have a common understanding of what the options under > consideration actually are. > > Since we were already planning on discussing this at the next community > sync, I suggest we pick this up there and use that time to align on what > exactly we're considering. We can then start a new thread to lay out the > designs under consideration in more detail and then have a discussion about > trade-offs. > > Does that sound reasonable? > > Ryan > > > On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >> I am finding it hard to interpret the options concretely. I would also >> suggest breaking the expectation/outcome to milestones. Maybe it becomes >> easier if we agree to distinguish between an approach that is feasible in >> the near term and another in the long term, especially if the latter >> requires significant engine-side changes. >> >> Further, maybe it helps if we start with an option that fully reuses the >> existing spec, and see how we view it in comparison with the options >> discussed previously. I am sharing one below. It reuses the current spec of >> Iceberg views and tables by leveraging table properties to capture >> materialized view metadata. What is common (and not common) between this >> and the desired representations? >> >> The new properties are: >> Properties on a View: >> >> 1. >> >> *iceberg.materialized.view*: >> - *Type*: View property >> - *Purpose*: This property is used to mark whether a view is a >> materialized view. If set to true, the view is treated as a >> materialized view. This helps in differentiating between virtual and >> materialized views within the catalog and dictates specific handling >> and >> validation logic for materialized views. >> 2. >> >> *iceberg.materialized.view.storage.location*: >> - *Type*: View property >> - *Purpose*: Specifies the location of the storage table >> associated with the materialized view. This property is used for >> linking a >> materialized view with its corresponding storage table, enabling data >> management and query execution based on the stored data freshness. >> >> Properties on a Table: >> >> 1. *base.snapshot.[UUID]*: >> - *Type*: Table property >> - *Purpose*: These properties store the snapshot IDs of the base >> tables at the time the materialized view's data was last updated. Each >> property is prefixed with base.snapshot. followed by the UUID of >> the base table. They are used to track whether the materialized view's >> data >> is up to date with the base tables by comparing these snapshot IDs >> with the >> current snapshot IDs of the base tables. If all the base tables' >> current >> snapshot IDs match the ones stored in these properties, the >> materialized >> view's data is considered fresh. >> >> >> Thanks, >> Walaa. >> >> >> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote: >> >>> > All of these approaches are aligned in one, specific way: the storage >>> table is an iceberg table. >>> >>> I do not think that is true. I think people are aligned that we would >>> like to re-use the Iceberg table metadata defined in the Iceberg table spec >>> to express the data in MV, but I don't think it goes that far to say it >>> must be an Iceberg table. Once you have that mindset, then of course option >>> 1 (separate table and view) is the only option. >>> >>> > I don't think that is necessary and it significantly increases the >>> complexity. >>> >>> And can you quantify what you mean by "significantly increases the >>> complexity"? Seems like a lot of concerns are coming from the tradeoff with >>> complexity. We probably all agree that using option 7 (a completely new >>> metadata type) is a lot of work from scratch, that is why it is not >>> favored. However, my understanding is that as long as we re-use the view >>> and table metadata, then the majority of the existing logic can be reused. >>> I think what we have gone through in Slack to draft the rough Java API >>> shape helps here, because people can estimate the amount of effort required >>> to implement it. And I don't think they are **significantly** more complex >>> to implement. Could you elaborate more about the complexity that you >>> imagine? >>> >>> -Jack >>> >>> >>> >>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <daniel.c.we...@gmail.com> >>> wrote: >>> >>>> I feel I've been most vocal about pushing back against options 2+ (or >>>> Ryan's categories of combined table/view, or new metadata type), so I'll >>>> try to expand on my reasoning. >>>> >>>> I understand the appeal of creating a design where we encapsulate the >>>> view/storage from both a structural and performance standpoint, but I don't >>>> think that is necessary and it significantly increases the complexity. >>>> >>>> All of these approaches are aligned in one, specific way: the storage >>>> table is an iceberg table. >>>> >>>> Because of this, all the behaviors and requirements still apply to >>>> these tables. They need to be maintained (snapshot cleanup, orphan files), >>>> in cases need to be optimized (compaction, manifest rewrites), they need to >>>> be able to be inspected (this will be even more important with MV since >>>> staleness can produce different results and questions will arise about what >>>> state the storage table was in). There may be cases where the tables need >>>> to be managed directly. >>>> >>>> Anywhere we deviate from the existing constructs/commit/access for >>>> tables, we will ultimately have to then unwrap to re-expose the underlying >>>> Iceberg behavior. This creates unnecessary complexity in the library/API >>>> layer, which are not the primary interface users will have with >>>> materialized views where an engine is almost entirely necessary to interact >>>> with the dataset. >>>> >>>> As to the performance concerns around option 1, I think we're >>>> overstating the downsides. It really comes down to how many metadata loads >>>> are necessary and evaluating freshness would likely be the real bottleneck >>>> as it involves potentially loading many tables. All of the options are on >>>> the same order of performance for the metadata and table loads. >>>> >>>> As to the visibility of tables and whether they're registered in the >>>> catalog, I think registering in the catalog is the right approach so that >>>> the tables are still addressable for maintenance/etc. The visibility of >>>> the storage table is a catalog implementation decision and shouldn't be a >>>> requirement of the MV spec (I can see cases for both and it isn't necessary >>>> to dictate a behavior). >>>> >>>> I'm still strongly in favor of Option 1 (separate table and view) for >>>> these reasons. >>>> >>>> -Dan >>>> >>>> >>>> >>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> > Jack, it sounds like you’re the proponent of a combined table and >>>>> view (rather than a new metadata spec for a materialized view). What is >>>>> the >>>>> main motivation? It seems like you’re convinced of that approach, but I >>>>> don’t understand the advantage it brings. >>>>> >>>>> Sorry I have to make a Google Sheet to capture all the options we have >>>>> discussed so far, I wanted to use the existing Google Doc, but it has >>>>> really bad table/sheet support... >>>>> >>>>> >>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0 >>>>> >>>>> I have listed all the options, with how they are implemented and some >>>>> important considerations we have discussed so far. Note that: >>>>> 1. This sheet currently excludes the lineage information, which we can >>>>> discuss more later after the current topic is resolved. >>>>> 2. I removed the considerations for REST integration since from the >>>>> other thread we have clarified that they should be considered completely >>>>> separately. >>>>> >>>>> *Why I come as a proponent of having a new MV object with table and >>>>> view metadata file pointer* >>>>> >>>>> In my sheet, there are 3 options that do not have major problems: >>>>> Option 2: Add storage table metadata file pointer in view object >>>>> Option 5: New MV object with table and view metadata file pointer >>>>> Option 6: New MV spec with table and view metadata >>>>> >>>>> I originally excluded option 2 because I think it does not align with >>>>> the REST spec, but after the other discussion thread about "Inconsistency >>>>> between REST spec and table/view spec", I think my original concern no >>>>> longer holds true so now I put it back. And based on my personal >>>>> preference that MV is an independent object that should be separated from >>>>> view and table, plus the fact that option 5 is probably less work than >>>>> option 6 for implementation, that is how I come as a proponent of option 5 >>>>> at this moment. >>>>> >>>>> >>>>> *Regarding Ryan's evaluation framework* >>>>> >>>>> I think we need to reconcile this sheet with Ryan's evaluation >>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 all >>>>> under the same category of "A combination of a view and a table" and >>>>> concludes that they don't have any advantage for the same set of reasons. >>>>> But those reasons are not really convincing to me so let's talk about them >>>>> in more detail. >>>>> >>>>> (1) You said "I don’t see a reason why a combined view and table is >>>>> advantageous" as "this would cause unnecessary dependence between the view >>>>> and table in catalogs." What dependency exactly do you mean here? And why >>>>> is that unnecessary, given there has to be some sort of dependency anyway >>>>> unless we go with option 5 or 6? >>>>> >>>>> (2) You said "I guess there’s an argument that you could load both >>>>> table and view metadata locations at the same time. That hardly seems >>>>> worth >>>>> the trouble". I disagree with that. Catalog interaction performance is >>>>> critical to at least everyone working in EMR and Athena, and MV itself as >>>>> an acceleration approach needs to be as fast as possible. >>>>> >>>>> I have put 3 key operations in the doc that I think matters for MV >>>>> during interactions with engine: >>>>> 1. refreshes storage table >>>>> 2. get the storage table of the MV >>>>> 3. if stale, get the view SQL >>>>> >>>>> And option 1 clearly falls short with 4 sequential steps required to >>>>> load a storage table. You mentioned "recent issues with adding views to >>>>> the >>>>> JDBC catalog" in this topic, could you explain a bit more? >>>>> >>>>> (3) You said "I also think that once we decide on structure, we can >>>>> make it possible for REST catalog implementations to do smart things, in a >>>>> way that doesn’t put additional requirements on the underlying catalog >>>>> store." If REST is fully compatible with Iceberg spec then I have no >>>>> problem with this statement. However, as we discussed in the other thread, >>>>> it is not the case. In the current state, I think the sequence of action >>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) first, >>>>> and then think about how REST can incorporate it or do smart things that >>>>> are not Iceberg spec compliant. Do you agree with that? >>>>> >>>>> (4) You said the table identifier pointer "is a problem we need to >>>>> solve generally because a materialized table needs to be able to track the >>>>> upstream state of tables that were used". I don't think that is a reason >>>>> to >>>>> choose to use a table identifier pointer for a storage table. The issue is >>>>> not about using a table identifier pointer. It is about exposing the >>>>> storage table as a separate entity in the catalog, which is what people do >>>>> not like and is already discussed in length in Jan's question 3 (also >>>>> linked in the sheet). I agree with that statement, because without a REST >>>>> implementation that can magically hide the storage table, this model adds >>>>> additional burden regarding compliance and data governance for any other >>>>> non-REST catalog implementations that are compliant to the Iceberg spec. >>>>> Many mechanisms need to be built in a catalog to hide, protect, maintain, >>>>> recycle the storage table, that can be avoided by using other approaches. >>>>> I >>>>> think we should reach a consensus about that and discuss further if you do >>>>> not agree. >>>>> >>>>> Best, >>>>> Jack Ye >>>>> >>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul <jank...@mailbox.org.invalid> >>>>> wrote: >>>>> >>>>>> Hi Ryan, we actually discussed your categories in this question >>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>. >>>>>> Where your categories correspond to the following designs: >>>>>> >>>>>> - Separate table and view => Design 1 >>>>>> - Combination of view and table => Design 2 >>>>>> - A new metadata type => Design 4 >>>>>> >>>>>> Jan >>>>>> On 01.03.24 00:03, Ryan Blue wrote: >>>>>> >>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so I’ll >>>>>> be more specific: >>>>>> >>>>>> - *Separate table and view*: this option is to have the objects >>>>>> that we have today, with extra metadata. Commit processes are >>>>>> separate: >>>>>> committing to the table doesn’t alter the view and committing to the >>>>>> view >>>>>> doesn’t change the table. However, changing the view can make it so >>>>>> the >>>>>> table is no longer useful as a materialization. >>>>>> - *A combination of a view and a table*: in this option, the >>>>>> table metadata and view metadata are the same as the first option. The >>>>>> difference is that the commit process combines them, either by >>>>>> embedding a >>>>>> table metadata location in view metadata or by tracking both in the >>>>>> same >>>>>> catalog reference. >>>>>> - *A new metadata type*: this option is where we define a new >>>>>> metadata object that has view attributes, like SQL representations, >>>>>> along >>>>>> with table attributes, like partition specs and snapshots. >>>>>> >>>>>> Hopefully this is clear because I think much of the confusion is >>>>>> caused by different definitions. >>>>>> >>>>>> The LoadTableResponse having optional metadata-location field implies >>>>>> that the object in the catalog no longer needs to hold a metadata file >>>>>> pointer >>>>>> >>>>>> The REST protocol has not removed the requirement for a metadata >>>>>> file, so I’m going to keep focused on the MV design options. >>>>>> >>>>>> When we say a MV can be a “new metadata type”, it does not mean it >>>>>> needs to define a completely brand new structure of the metadata content >>>>>> >>>>>> I’m making a distinction between separate metadata files for the >>>>>> table and the view and a combined metadata object, as above. >>>>>> >>>>>> We can define an “Iceberg MV” to be an object in a catalog, which has >>>>>> 1 table metadata file pointer, and 1 view metadata file pointer >>>>>> >>>>>> This is the option I am referring to as a “combination of a view and >>>>>> a table”. >>>>>> >>>>>> So to review my initial email, I don’t see a reason why a combined >>>>>> view and table is advantageous, either implemented by having a catalog >>>>>> reference with two metadata locations or embedding a table metadata >>>>>> location in view metadata. This would cause unnecessary dependence >>>>>> between >>>>>> the view and table in catalogs. I guess there’s an argument that you >>>>>> could >>>>>> load both table and view metadata locations at the same time. That hardly >>>>>> seems worth the trouble given the recent issues with adding views to the >>>>>> JDBC catalog. >>>>>> >>>>>> I also think that once we decide on structure, we can make it >>>>>> possible for REST catalog implementations to do smart things, in a way >>>>>> that >>>>>> doesn’t put additional requirements on the underlying catalog store. For >>>>>> instance, we could specify how to send additional objects in a >>>>>> LoadViewResult, in case the catalog wants to pre-fetch table metadata. I >>>>>> think these optimizations are a later addition, after we define the >>>>>> relationship between views and tables. >>>>>> >>>>>> Jack, it sounds like you’re the proponent of a combined table and >>>>>> view (rather than a new metadata spec for a materialized view). What is >>>>>> the >>>>>> main motivation? It seems like you’re convinced of that approach, but I >>>>>> don’t understand the advantage it brings. >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> Yes I mostly agree with the assessment. To clarify a few minor >>>>>>> points. >>>>>>> >>>>>>> is a materialized view a view and a separate table, a combination of >>>>>>>> the two (i.e. commits are combined), or a new metadata type? >>>>>>> >>>>>>> >>>>>>> For 'new metadata type', I consider mostly Jack's initial proposal >>>>>>> of a new Catalog MV object that has two references (ViewMetadata + >>>>>>> TableMetadata). >>>>>>> >>>>>>> The arguments that I see for a combined materialized view object >>>>>>>> are: >>>>>>>> >>>>>>>> - Regular views are separate, rather than being tables with SQL >>>>>>>> and no data so it would be inconsistent (“Iceberg view is just a >>>>>>>> table with >>>>>>>> no data but with representations defined. But we did not do that.”) >>>>>>>> >>>>>>>> >>>>>>>> - Materialized views are different objects in DDL >>>>>>>> >>>>>>>> >>>>>>>> - Tables may be a superset of functionality needed for >>>>>>>> materialized views >>>>>>>> >>>>>>>> >>>>>>>> - Tables are not typically exposed to end users — but this >>>>>>>> isn’t required by the separate view and table option >>>>>>>> >>>>>>>> For completeness, there seem to be a few additional ones (mentioned >>>>>>> in the Slack and above messages). >>>>>>> >>>>>>> - Lack of spec change (to ViewMetadata). But as Jack says it is >>>>>>> a spec change (ie, to catalogs) >>>>>>> - A single call to get the View's StorageTable (versus two calls) >>>>>>> - A more natural API, no opportunity for user to call >>>>>>> Catalog.dropTable() and renameTable() on storage table >>>>>>> >>>>>>> >>>>>>> *Thoughts: *I think the long discussion sessions we had on Slack >>>>>>> was fruitful for me, as seeing the API clarified some things. >>>>>>> >>>>>>> I was initially more in favor of MV being a new metadata type >>>>>>> (TableMetadata + ViewMetadata). But seeing most of the MV operations >>>>>>> end >>>>>>> up being ViewCatalog or Catalog operations, I am starting to think >>>>>>> API-wise >>>>>>> that it may not align with the new metadata type (unless we define >>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate wrappers). >>>>>>> >>>>>>> Initially one question I had for option 'a view and a separate >>>>>>> table', was how to make this table reference (metadata.json or catalog >>>>>>> reference). In the previous option, we had a precedent of Catalog >>>>>>> references to Metadata, but not pointers between Metadatas. I initially >>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' >>>>>>> catalog >>>>>>> concerns in ViewMetadata. (I saw Catalog and ViewCatalog as a layer >>>>>>> above >>>>>>> TableMetadata and ViewMetadata). But I think Dan in the Slack made a >>>>>>> fair >>>>>>> point that ViewMetadata already is tightly bound with a Catalog. In >>>>>>> this >>>>>>> case, I think this approach does have its merits as well in aligning >>>>>>> Catalog API's with the metadata. >>>>>>> >>>>>>> Thanks >>>>>>> Szehon >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul >>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I would like to provide my perspective on the question of what a >>>>>>>> materialized view is and elaborate on Jack's recent proposal to view a >>>>>>>> materialized view as a catalog concept. >>>>>>>> >>>>>>>> Firstly, let's look at the role of the catalog. Every entity in the >>>>>>>> catalog has a *unique identifier*, and the catalog provides >>>>>>>> methods to create, load, and update these entities. An important thing >>>>>>>> to >>>>>>>> note is that the catalog methods exhibit two different behaviors: the >>>>>>>> *create >>>>>>>> and load methods deal with the entire entity*, while the >>>>>>>> *update(commit) >>>>>>>> method only deals with partial changes* to the entities. >>>>>>>> >>>>>>>> In the context of our current discussion, materialized view (MV) >>>>>>>> metadata is a union of view and table metadata. The fact that the >>>>>>>> update >>>>>>>> method deals only with partial changes, enables us to *reuse the >>>>>>>> existing methods for updating tables and views*. For updates we >>>>>>>> don't have to define what constitutes an entire materialized view. >>>>>>>> Changes >>>>>>>> to a materialized view targeting the properties related to the view >>>>>>>> metadata could use the update(commit) view method. Similarly, changes >>>>>>>> targeting the properties related to the table metadata could use the >>>>>>>> update(commit) table method. This is great news because we don't have >>>>>>>> to >>>>>>>> redefine view and table commits (requirements, updates). >>>>>>>> This is shown in the fact that Jack uses the same operation to >>>>>>>> update the storage table for Option 1 and 3: >>>>>>>> >>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true >>>>>>>> // non-REST: update JSON files at table_metadata_location >>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>> >>>>>>>> The open question is *whether the create and load methods should >>>>>>>> treat the properties that constitute the MV metadata as two entities >>>>>>>> (View >>>>>>>> + Table) or one entity (new MV object)*. This is all part of >>>>>>>> Jack's proposal, where Option 1 proposes a new MV object, and Option 3 >>>>>>>> proposes two separate entities. The advantage of Option 1 is that it >>>>>>>> doesn't require two operations to load the metadata. On the other >>>>>>>> hand, the >>>>>>>> advantage of Option 3 is that no new operations or catalogs have to be >>>>>>>> defined. >>>>>>>> >>>>>>>> In my opinion, defining a new representation for materialized views >>>>>>>> (Option 1) is generally the cleaner solution. However, I see a path >>>>>>>> where >>>>>>>> we could first introduce Option 3 and still have the possibility to >>>>>>>> transition to Option 1 if needed. The great thing about Option 3 is >>>>>>>> that it >>>>>>>> only requires minor changes to the current spec and is mostly >>>>>>>> implementation detail. >>>>>>>> >>>>>>>> Therefore I would propose small additions to Jacks Option 3 that >>>>>>>> only introduce changes to the spec that are not specific to >>>>>>>> materialized >>>>>>>> views. The idea is to introduce boolean properties to be set on the >>>>>>>> creation of the view and the storage table that indicate that they >>>>>>>> belong >>>>>>>> to a materialized view. The view property "materialized" is set to >>>>>>>> "true" >>>>>>>> for a MV and "false" for a regular view. And the table property >>>>>>>> "storage_table" is set to "true" for a storage table and "false" for a >>>>>>>> regular table. The absence of these properties indicates a regular >>>>>>>> view or >>>>>>>> table. >>>>>>>> >>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog; >>>>>>>> >>>>>>>> // REST: GET /namespaces/db1/views/mv1 >>>>>>>> // non-REST: load JSON file at metadata_location >>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1")); >>>>>>>> >>>>>>>> // REST: GET /namespaces/db1/tables/mv1 >>>>>>>> // non-REST: load JSON file at table_metadata_location if present >>>>>>>> Table storageTable = view.storageTable(); >>>>>>>> >>>>>>>> // REST: POST /namespaces/db1/tables/mv1 >>>>>>>> // non-REST: update JSON file at table_metadata_location >>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>> >>>>>>>> We could then introduce a new requirement for views and tables >>>>>>>> called "AssertProperty" which could make sure to only perform updates >>>>>>>> that >>>>>>>> are inline with materialized views. The additional requirement can be >>>>>>>> seen >>>>>>>> as a general extension which does not need to be changed if we decide >>>>>>>> to >>>>>>>> got with Option 1 in the future. >>>>>>>> >>>>>>>> Let me know what you think. >>>>>>>> >>>>>>>> Best wishes, >>>>>>>> >>>>>>>> Jan >>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote: >>>>>>>> >>>>>>>> Thanks Ryan for the insights. I agree that reusing existing >>>>>>>> metadata definitions and minimizing spec changes are very important. >>>>>>>> This >>>>>>>> also minimizes spec drift (between materialized views and views spec, >>>>>>>> and >>>>>>>> between materialized views and tables spec), and simplifies the >>>>>>>> implementation. >>>>>>>> >>>>>>>> In an effort to take the discussion forward with concrete design >>>>>>>> options based on an end-to-end implementation, I have prototyped the >>>>>>>> implementation (and added Spark support) in this PR >>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps us >>>>>>>> reach convergence faster. More details about some of the design >>>>>>>> options are >>>>>>>> discussed in the description of the PR. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Walaa. >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>>> >>>>>>>>> I mean separate table and view metadata that is somehow combined >>>>>>>>> through a commit process. For instance, keeping a pointer to a table >>>>>>>>> metadata file in a view metadata file or combining commits to >>>>>>>>> reference >>>>>>>>> both. I don't see the value in either option. >>>>>>>>> >>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks Ryan for the help to trace back to the root question! Just >>>>>>>>>> a clarification question regarding your reply before I reply >>>>>>>>>> further: what >>>>>>>>>> exactly does the option "a combination of the two (i.e. commits are >>>>>>>>>> combined)" mean? How is that different from "a new metadata type"? >>>>>>>>>> >>>>>>>>>> -Jack >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I’m catching up on this conversation, so hopefully I can bring a >>>>>>>>>>> fresh perspective. >>>>>>>>>>> >>>>>>>>>>> Jack already pointed out that we need to start from the basics >>>>>>>>>>> and I agree with that. Let’s remove voting at this point. Right now >>>>>>>>>>> is the >>>>>>>>>>> time for discussing trade-offs, not lining up and taking sides. I >>>>>>>>>>> realize >>>>>>>>>>> that wasn’t the intent with adding a vote, but that’s almost always >>>>>>>>>>> the >>>>>>>>>>> result. It’s too easy to use it as a stand-in for consensus and >>>>>>>>>>> move on >>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that >>>>>>>>>>> discussion >>>>>>>>>>> has moved ahead of agreement. >>>>>>>>>>> >>>>>>>>>>> We’re still at the most basic question: is a materialized view a >>>>>>>>>>> view and a separate table, a combination of the two (i.e. commits >>>>>>>>>>> are >>>>>>>>>>> combined), or a new metadata type? >>>>>>>>>>> >>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some kind >>>>>>>>>>> of “system table” (meaning hidden?) or if it is exposed in the >>>>>>>>>>> catalog. >>>>>>>>>>> That’s a later choice (already pointed out) and, I suspect, it >>>>>>>>>>> should be >>>>>>>>>>> delegated to catalog implementations. >>>>>>>>>>> >>>>>>>>>>> To simplify this a little, I think that we can eliminate the >>>>>>>>>>> option to combine table and view commits. I don’t think there is a >>>>>>>>>>> reason >>>>>>>>>>> to combine the two. If separate, a table would track the view >>>>>>>>>>> version used >>>>>>>>>>> along with freshness information for referenced tables. If the >>>>>>>>>>> table is >>>>>>>>>>> automatically skipped when the version no longer matches the view, >>>>>>>>>>> then no >>>>>>>>>>> action needs to happen when a view definition changes. Similarly, >>>>>>>>>>> the table >>>>>>>>>>> can be updated independently without needing to also swap view >>>>>>>>>>> metadata. >>>>>>>>>>> This also aligns with the idea from the original doc that there can >>>>>>>>>>> be >>>>>>>>>>> multiple materialization tables for a view. Each should operate >>>>>>>>>>> independently unless I’m missing something >>>>>>>>>>> >>>>>>>>>>> I don’t think the last paragraph’s conclusion is contentious so >>>>>>>>>>> I’ll move on, but please stop here and reply if you disagree! >>>>>>>>>>> >>>>>>>>>>> That leaves the main two options, a view and a separate table >>>>>>>>>>> linked by metadata, or, combined materialized view metadata. >>>>>>>>>>> >>>>>>>>>>> As the doc notes, the separate view and table option is simpler >>>>>>>>>>> because it reuses existing metadata definitions and falls back to >>>>>>>>>>> simple >>>>>>>>>>> views. That is a significantly smaller spec and small is very, very >>>>>>>>>>> important when it comes to specs. I think that the argument for a >>>>>>>>>>> new >>>>>>>>>>> definition of a materialized view needs to overcome this >>>>>>>>>>> disadvantage. >>>>>>>>>>> >>>>>>>>>>> The arguments that I see for a combined materialized view object >>>>>>>>>>> are: >>>>>>>>>>> >>>>>>>>>>> - Regular views are separate, rather than being tables with >>>>>>>>>>> SQL and no data so it would be inconsistent (“Iceberg view is >>>>>>>>>>> just a table >>>>>>>>>>> with no data but with representations defined. But we did not do >>>>>>>>>>> that.”) >>>>>>>>>>> - Materialized views are different objects in DDL >>>>>>>>>>> - Tables may be a superset of functionality needed for >>>>>>>>>>> materialized views >>>>>>>>>>> - Tables are not typically exposed to end users — but this >>>>>>>>>>> isn’t required by the separate view and table option >>>>>>>>>>> >>>>>>>>>>> Am I missing any arguments for combined metadata? >>>>>>>>>>> >>>>>>>>>>> Ryan >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Tabular >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Tabular >>>>>>>>> >>>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>>> > > -- > Ryan Blue > Tabular >