The calendar on the site is currently broken https://iceberg.apache.org/community/#iceberg-community-events. Might help to fix it or share the meeting link here.
On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <yezhao...@gmail.com> wrote: > Sounds good, let's discuss this in person! > > I am a bit worried that we have quite a few critical topics going on right > now on devlist, and this will take up a lot of time to discuss. If it ends > up going for too long, l propose let us have a dedicated meeting, and I am > more than happy to organize it. > > Best, > Jack Ye > > On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> wrote: > >> Hey everyone, >> >> I think this thread has hit a point of diminishing returns and that we >> still don't have a common understanding of what the options under >> consideration actually are. >> >> Since we were already planning on discussing this at the next community >> sync, I suggest we pick this up there and use that time to align on what >> exactly we're considering. We can then start a new thread to lay out the >> designs under consideration in more detail and then have a discussion about >> trade-offs. >> >> Does that sound reasonable? >> >> Ryan >> >> >> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> I am finding it hard to interpret the options concretely. I would also >>> suggest breaking the expectation/outcome to milestones. Maybe it becomes >>> easier if we agree to distinguish between an approach that is feasible in >>> the near term and another in the long term, especially if the latter >>> requires significant engine-side changes. >>> >>> Further, maybe it helps if we start with an option that fully reuses the >>> existing spec, and see how we view it in comparison with the options >>> discussed previously. I am sharing one below. It reuses the current spec of >>> Iceberg views and tables by leveraging table properties to capture >>> materialized view metadata. What is common (and not common) between this >>> and the desired representations? >>> >>> The new properties are: >>> Properties on a View: >>> >>> 1. >>> >>> *iceberg.materialized.view*: >>> - *Type*: View property >>> - *Purpose*: This property is used to mark whether a view is a >>> materialized view. If set to true, the view is treated as a >>> materialized view. This helps in differentiating between virtual and >>> materialized views within the catalog and dictates specific handling >>> and >>> validation logic for materialized views. >>> 2. >>> >>> *iceberg.materialized.view.storage.location*: >>> - *Type*: View property >>> - *Purpose*: Specifies the location of the storage table >>> associated with the materialized view. This property is used for >>> linking a >>> materialized view with its corresponding storage table, enabling data >>> management and query execution based on the stored data freshness. >>> >>> Properties on a Table: >>> >>> 1. *base.snapshot.[UUID]*: >>> - *Type*: Table property >>> - *Purpose*: These properties store the snapshot IDs of the base >>> tables at the time the materialized view's data was last updated. Each >>> property is prefixed with base.snapshot. followed by the UUID of >>> the base table. They are used to track whether the materialized >>> view's data >>> is up to date with the base tables by comparing these snapshot IDs >>> with the >>> current snapshot IDs of the base tables. If all the base tables' >>> current >>> snapshot IDs match the ones stored in these properties, the >>> materialized >>> view's data is considered fresh. >>> >>> >>> Thanks, >>> Walaa. >>> >>> >>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> > All of these approaches are aligned in one, specific way: the storage >>>> table is an iceberg table. >>>> >>>> I do not think that is true. I think people are aligned that we would >>>> like to re-use the Iceberg table metadata defined in the Iceberg table spec >>>> to express the data in MV, but I don't think it goes that far to say it >>>> must be an Iceberg table. Once you have that mindset, then of course option >>>> 1 (separate table and view) is the only option. >>>> >>>> > I don't think that is necessary and it significantly increases the >>>> complexity. >>>> >>>> And can you quantify what you mean by "significantly increases the >>>> complexity"? Seems like a lot of concerns are coming from the tradeoff with >>>> complexity. We probably all agree that using option 7 (a completely new >>>> metadata type) is a lot of work from scratch, that is why it is not >>>> favored. However, my understanding is that as long as we re-use the view >>>> and table metadata, then the majority of the existing logic can be reused. >>>> I think what we have gone through in Slack to draft the rough Java API >>>> shape helps here, because people can estimate the amount of effort required >>>> to implement it. And I don't think they are **significantly** more complex >>>> to implement. Could you elaborate more about the complexity that you >>>> imagine? >>>> >>>> -Jack >>>> >>>> >>>> >>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <daniel.c.we...@gmail.com> >>>> wrote: >>>> >>>>> I feel I've been most vocal about pushing back against options 2+ (or >>>>> Ryan's categories of combined table/view, or new metadata type), so I'll >>>>> try to expand on my reasoning. >>>>> >>>>> I understand the appeal of creating a design where we encapsulate the >>>>> view/storage from both a structural and performance standpoint, but I >>>>> don't >>>>> think that is necessary and it significantly increases the complexity. >>>>> >>>>> All of these approaches are aligned in one, specific way: the storage >>>>> table is an iceberg table. >>>>> >>>>> Because of this, all the behaviors and requirements still apply to >>>>> these tables. They need to be maintained (snapshot cleanup, orphan >>>>> files), >>>>> in cases need to be optimized (compaction, manifest rewrites), they need >>>>> to >>>>> be able to be inspected (this will be even more important with MV since >>>>> staleness can produce different results and questions will arise about >>>>> what >>>>> state the storage table was in). There may be cases where the tables need >>>>> to be managed directly. >>>>> >>>>> Anywhere we deviate from the existing constructs/commit/access for >>>>> tables, we will ultimately have to then unwrap to re-expose the underlying >>>>> Iceberg behavior. This creates unnecessary complexity in the library/API >>>>> layer, which are not the primary interface users will have with >>>>> materialized views where an engine is almost entirely necessary to >>>>> interact >>>>> with the dataset. >>>>> >>>>> As to the performance concerns around option 1, I think we're >>>>> overstating the downsides. It really comes down to how many metadata >>>>> loads >>>>> are necessary and evaluating freshness would likely be the real bottleneck >>>>> as it involves potentially loading many tables. All of the options are on >>>>> the same order of performance for the metadata and table loads. >>>>> >>>>> As to the visibility of tables and whether they're registered in the >>>>> catalog, I think registering in the catalog is the right approach so that >>>>> the tables are still addressable for maintenance/etc. The visibility of >>>>> the storage table is a catalog implementation decision and shouldn't be a >>>>> requirement of the MV spec (I can see cases for both and it isn't >>>>> necessary >>>>> to dictate a behavior). >>>>> >>>>> I'm still strongly in favor of Option 1 (separate table and view) for >>>>> these reasons. >>>>> >>>>> -Dan >>>>> >>>>> >>>>> >>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>> >>>>>> > Jack, it sounds like you’re the proponent of a combined table and >>>>>> view (rather than a new metadata spec for a materialized view). What is >>>>>> the >>>>>> main motivation? It seems like you’re convinced of that approach, but I >>>>>> don’t understand the advantage it brings. >>>>>> >>>>>> Sorry I have to make a Google Sheet to capture all the options we >>>>>> have discussed so far, I wanted to use the existing Google Doc, but it >>>>>> has >>>>>> really bad table/sheet support... >>>>>> >>>>>> >>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0 >>>>>> >>>>>> I have listed all the options, with how they are implemented and some >>>>>> important considerations we have discussed so far. Note that: >>>>>> 1. This sheet currently excludes the lineage information, which we >>>>>> can discuss more later after the current topic is resolved. >>>>>> 2. I removed the considerations for REST integration since from the >>>>>> other thread we have clarified that they should be considered completely >>>>>> separately. >>>>>> >>>>>> *Why I come as a proponent of having a new MV object with table and >>>>>> view metadata file pointer* >>>>>> >>>>>> In my sheet, there are 3 options that do not have major problems: >>>>>> Option 2: Add storage table metadata file pointer in view object >>>>>> Option 5: New MV object with table and view metadata file pointer >>>>>> Option 6: New MV spec with table and view metadata >>>>>> >>>>>> I originally excluded option 2 because I think it does not align with >>>>>> the REST spec, but after the other discussion thread about "Inconsistency >>>>>> between REST spec and table/view spec", I think my original concern no >>>>>> longer holds true so now I put it back. And based on my personal >>>>>> preference that MV is an independent object that should be separated from >>>>>> view and table, plus the fact that option 5 is probably less work than >>>>>> option 6 for implementation, that is how I come as a proponent of option >>>>>> 5 >>>>>> at this moment. >>>>>> >>>>>> >>>>>> *Regarding Ryan's evaluation framework* >>>>>> >>>>>> I think we need to reconcile this sheet with Ryan's evaluation >>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 all >>>>>> under the same category of "A combination of a view and a table" and >>>>>> concludes that they don't have any advantage for the same set of reasons. >>>>>> But those reasons are not really convincing to me so let's talk about >>>>>> them >>>>>> in more detail. >>>>>> >>>>>> (1) You said "I don’t see a reason why a combined view and table is >>>>>> advantageous" as "this would cause unnecessary dependence between the >>>>>> view >>>>>> and table in catalogs." What dependency exactly do you mean here? And >>>>>> why >>>>>> is that unnecessary, given there has to be some sort of dependency anyway >>>>>> unless we go with option 5 or 6? >>>>>> >>>>>> (2) You said "I guess there’s an argument that you could load both >>>>>> table and view metadata locations at the same time. That hardly seems >>>>>> worth >>>>>> the trouble". I disagree with that. Catalog interaction performance is >>>>>> critical to at least everyone working in EMR and Athena, and MV itself as >>>>>> an acceleration approach needs to be as fast as possible. >>>>>> >>>>>> I have put 3 key operations in the doc that I think matters for MV >>>>>> during interactions with engine: >>>>>> 1. refreshes storage table >>>>>> 2. get the storage table of the MV >>>>>> 3. if stale, get the view SQL >>>>>> >>>>>> And option 1 clearly falls short with 4 sequential steps required to >>>>>> load a storage table. You mentioned "recent issues with adding views to >>>>>> the >>>>>> JDBC catalog" in this topic, could you explain a bit more? >>>>>> >>>>>> (3) You said "I also think that once we decide on structure, we can >>>>>> make it possible for REST catalog implementations to do smart things, in >>>>>> a >>>>>> way that doesn’t put additional requirements on the underlying catalog >>>>>> store." If REST is fully compatible with Iceberg spec then I have no >>>>>> problem with this statement. However, as we discussed in the other >>>>>> thread, >>>>>> it is not the case. In the current state, I think the sequence of action >>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) first, >>>>>> and then think about how REST can incorporate it or do smart things that >>>>>> are not Iceberg spec compliant. Do you agree with that? >>>>>> >>>>>> (4) You said the table identifier pointer "is a problem we need to >>>>>> solve generally because a materialized table needs to be able to track >>>>>> the >>>>>> upstream state of tables that were used". I don't think that is a reason >>>>>> to >>>>>> choose to use a table identifier pointer for a storage table. The issue >>>>>> is >>>>>> not about using a table identifier pointer. It is about exposing the >>>>>> storage table as a separate entity in the catalog, which is what people >>>>>> do >>>>>> not like and is already discussed in length in Jan's question 3 (also >>>>>> linked in the sheet). I agree with that statement, because without a REST >>>>>> implementation that can magically hide the storage table, this model adds >>>>>> additional burden regarding compliance and data governance for any other >>>>>> non-REST catalog implementations that are compliant to the Iceberg spec. >>>>>> Many mechanisms need to be built in a catalog to hide, protect, maintain, >>>>>> recycle the storage table, that can be avoided by using other >>>>>> approaches. I >>>>>> think we should reach a consensus about that and discuss further if you >>>>>> do >>>>>> not agree. >>>>>> >>>>>> Best, >>>>>> Jack Ye >>>>>> >>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul <jank...@mailbox.org.invalid> >>>>>> wrote: >>>>>> >>>>>>> Hi Ryan, we actually discussed your categories in this question >>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>. >>>>>>> Where your categories correspond to the following designs: >>>>>>> >>>>>>> - Separate table and view => Design 1 >>>>>>> - Combination of view and table => Design 2 >>>>>>> - A new metadata type => Design 4 >>>>>>> >>>>>>> Jan >>>>>>> On 01.03.24 00:03, Ryan Blue wrote: >>>>>>> >>>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so >>>>>>> I’ll be more specific: >>>>>>> >>>>>>> - *Separate table and view*: this option is to have the objects >>>>>>> that we have today, with extra metadata. Commit processes are >>>>>>> separate: >>>>>>> committing to the table doesn’t alter the view and committing to the >>>>>>> view >>>>>>> doesn’t change the table. However, changing the view can make it so >>>>>>> the >>>>>>> table is no longer useful as a materialization. >>>>>>> - *A combination of a view and a table*: in this option, the >>>>>>> table metadata and view metadata are the same as the first option. >>>>>>> The >>>>>>> difference is that the commit process combines them, either by >>>>>>> embedding a >>>>>>> table metadata location in view metadata or by tracking both in the >>>>>>> same >>>>>>> catalog reference. >>>>>>> - *A new metadata type*: this option is where we define a new >>>>>>> metadata object that has view attributes, like SQL representations, >>>>>>> along >>>>>>> with table attributes, like partition specs and snapshots. >>>>>>> >>>>>>> Hopefully this is clear because I think much of the confusion is >>>>>>> caused by different definitions. >>>>>>> >>>>>>> The LoadTableResponse having optional metadata-location field >>>>>>> implies that the object in the catalog no longer needs to hold a >>>>>>> metadata >>>>>>> file pointer >>>>>>> >>>>>>> The REST protocol has not removed the requirement for a metadata >>>>>>> file, so I’m going to keep focused on the MV design options. >>>>>>> >>>>>>> When we say a MV can be a “new metadata type”, it does not mean it >>>>>>> needs to define a completely brand new structure of the metadata content >>>>>>> >>>>>>> I’m making a distinction between separate metadata files for the >>>>>>> table and the view and a combined metadata object, as above. >>>>>>> >>>>>>> We can define an “Iceberg MV” to be an object in a catalog, which >>>>>>> has 1 table metadata file pointer, and 1 view metadata file pointer >>>>>>> >>>>>>> This is the option I am referring to as a “combination of a view and >>>>>>> a table”. >>>>>>> >>>>>>> So to review my initial email, I don’t see a reason why a combined >>>>>>> view and table is advantageous, either implemented by having a catalog >>>>>>> reference with two metadata locations or embedding a table metadata >>>>>>> location in view metadata. This would cause unnecessary dependence >>>>>>> between >>>>>>> the view and table in catalogs. I guess there’s an argument that you >>>>>>> could >>>>>>> load both table and view metadata locations at the same time. That >>>>>>> hardly >>>>>>> seems worth the trouble given the recent issues with adding views to the >>>>>>> JDBC catalog. >>>>>>> >>>>>>> I also think that once we decide on structure, we can make it >>>>>>> possible for REST catalog implementations to do smart things, in a way >>>>>>> that >>>>>>> doesn’t put additional requirements on the underlying catalog store. For >>>>>>> instance, we could specify how to send additional objects in a >>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table metadata. I >>>>>>> think these optimizations are a later addition, after we define the >>>>>>> relationship between views and tables. >>>>>>> >>>>>>> Jack, it sounds like you’re the proponent of a combined table and >>>>>>> view (rather than a new metadata spec for a materialized view). What is >>>>>>> the >>>>>>> main motivation? It seems like you’re convinced of that approach, but I >>>>>>> don’t understand the advantage it brings. >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi >>>>>>>> >>>>>>>> Yes I mostly agree with the assessment. To clarify a few minor >>>>>>>> points. >>>>>>>> >>>>>>>> is a materialized view a view and a separate table, a combination >>>>>>>>> of the two (i.e. commits are combined), or a new metadata type? >>>>>>>> >>>>>>>> >>>>>>>> For 'new metadata type', I consider mostly Jack's initial proposal >>>>>>>> of a new Catalog MV object that has two references (ViewMetadata + >>>>>>>> TableMetadata). >>>>>>>> >>>>>>>> The arguments that I see for a combined materialized view object >>>>>>>>> are: >>>>>>>>> >>>>>>>>> - Regular views are separate, rather than being tables with >>>>>>>>> SQL and no data so it would be inconsistent (“Iceberg view is just >>>>>>>>> a table >>>>>>>>> with no data but with representations defined. But we did not do >>>>>>>>> that.”) >>>>>>>>> >>>>>>>>> >>>>>>>>> - Materialized views are different objects in DDL >>>>>>>>> >>>>>>>>> >>>>>>>>> - Tables may be a superset of functionality needed for >>>>>>>>> materialized views >>>>>>>>> >>>>>>>>> >>>>>>>>> - Tables are not typically exposed to end users — but this >>>>>>>>> isn’t required by the separate view and table option >>>>>>>>> >>>>>>>>> For completeness, there seem to be a few additional ones >>>>>>>> (mentioned in the Slack and above messages). >>>>>>>> >>>>>>>> - Lack of spec change (to ViewMetadata). But as Jack says it >>>>>>>> is a spec change (ie, to catalogs) >>>>>>>> - A single call to get the View's StorageTable (versus two >>>>>>>> calls) >>>>>>>> - A more natural API, no opportunity for user to call >>>>>>>> Catalog.dropTable() and renameTable() on storage table >>>>>>>> >>>>>>>> >>>>>>>> *Thoughts: *I think the long discussion sessions we had on Slack >>>>>>>> was fruitful for me, as seeing the API clarified some things. >>>>>>>> >>>>>>>> I was initially more in favor of MV being a new metadata type >>>>>>>> (TableMetadata + ViewMetadata). But seeing most of the MV operations >>>>>>>> end >>>>>>>> up being ViewCatalog or Catalog operations, I am starting to think >>>>>>>> API-wise >>>>>>>> that it may not align with the new metadata type (unless we define >>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate wrappers). >>>>>>>> >>>>>>>> Initially one question I had for option 'a view and a separate >>>>>>>> table', was how to make this table reference (metadata.json or catalog >>>>>>>> reference). In the previous option, we had a precedent of Catalog >>>>>>>> references to Metadata, but not pointers between Metadatas. I >>>>>>>> initially >>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' >>>>>>>> catalog >>>>>>>> concerns in ViewMetadata. (I saw Catalog and ViewCatalog as a layer >>>>>>>> above >>>>>>>> TableMetadata and ViewMetadata). But I think Dan in the Slack made a >>>>>>>> fair >>>>>>>> point that ViewMetadata already is tightly bound with a Catalog. In >>>>>>>> this >>>>>>>> case, I think this approach does have its merits as well in aligning >>>>>>>> Catalog API's with the metadata. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Szehon >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul >>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I would like to provide my perspective on the question of what a >>>>>>>>> materialized view is and elaborate on Jack's recent proposal to view a >>>>>>>>> materialized view as a catalog concept. >>>>>>>>> >>>>>>>>> Firstly, let's look at the role of the catalog. Every entity in >>>>>>>>> the catalog has a *unique identifier*, and the catalog provides >>>>>>>>> methods to create, load, and update these entities. An important >>>>>>>>> thing to >>>>>>>>> note is that the catalog methods exhibit two different behaviors: the >>>>>>>>> *create >>>>>>>>> and load methods deal with the entire entity*, while the >>>>>>>>> *update(commit) >>>>>>>>> method only deals with partial changes* to the entities. >>>>>>>>> >>>>>>>>> In the context of our current discussion, materialized view (MV) >>>>>>>>> metadata is a union of view and table metadata. The fact that the >>>>>>>>> update >>>>>>>>> method deals only with partial changes, enables us to *reuse the >>>>>>>>> existing methods for updating tables and views*. For updates we >>>>>>>>> don't have to define what constitutes an entire materialized view. >>>>>>>>> Changes >>>>>>>>> to a materialized view targeting the properties related to the view >>>>>>>>> metadata could use the update(commit) view method. Similarly, changes >>>>>>>>> targeting the properties related to the table metadata could use the >>>>>>>>> update(commit) table method. This is great news because we don't have >>>>>>>>> to >>>>>>>>> redefine view and table commits (requirements, updates). >>>>>>>>> This is shown in the fact that Jack uses the same operation to >>>>>>>>> update the storage table for Option 1 and 3: >>>>>>>>> >>>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true >>>>>>>>> // non-REST: update JSON files at table_metadata_location >>>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>>> >>>>>>>>> The open question is *whether the create and load methods should >>>>>>>>> treat the properties that constitute the MV metadata as two entities >>>>>>>>> (View >>>>>>>>> + Table) or one entity (new MV object)*. This is all part of >>>>>>>>> Jack's proposal, where Option 1 proposes a new MV object, and Option 3 >>>>>>>>> proposes two separate entities. The advantage of Option 1 is that it >>>>>>>>> doesn't require two operations to load the metadata. On the other >>>>>>>>> hand, the >>>>>>>>> advantage of Option 3 is that no new operations or catalogs have to be >>>>>>>>> defined. >>>>>>>>> >>>>>>>>> In my opinion, defining a new representation for materialized >>>>>>>>> views (Option 1) is generally the cleaner solution. However, I see a >>>>>>>>> path >>>>>>>>> where we could first introduce Option 3 and still have the >>>>>>>>> possibility to >>>>>>>>> transition to Option 1 if needed. The great thing about Option 3 is >>>>>>>>> that it >>>>>>>>> only requires minor changes to the current spec and is mostly >>>>>>>>> implementation detail. >>>>>>>>> >>>>>>>>> Therefore I would propose small additions to Jacks Option 3 that >>>>>>>>> only introduce changes to the spec that are not specific to >>>>>>>>> materialized >>>>>>>>> views. The idea is to introduce boolean properties to be set on the >>>>>>>>> creation of the view and the storage table that indicate that they >>>>>>>>> belong >>>>>>>>> to a materialized view. The view property "materialized" is set to >>>>>>>>> "true" >>>>>>>>> for a MV and "false" for a regular view. And the table property >>>>>>>>> "storage_table" is set to "true" for a storage table and "false" for a >>>>>>>>> regular table. The absence of these properties indicates a regular >>>>>>>>> view or >>>>>>>>> table. >>>>>>>>> >>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog; >>>>>>>>> >>>>>>>>> // REST: GET /namespaces/db1/views/mv1 >>>>>>>>> // non-REST: load JSON file at metadata_location >>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1")); >>>>>>>>> >>>>>>>>> // REST: GET /namespaces/db1/tables/mv1 >>>>>>>>> // non-REST: load JSON file at table_metadata_location if present >>>>>>>>> Table storageTable = view.storageTable(); >>>>>>>>> >>>>>>>>> // REST: POST /namespaces/db1/tables/mv1 >>>>>>>>> // non-REST: update JSON file at table_metadata_location >>>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>>> >>>>>>>>> We could then introduce a new requirement for views and tables >>>>>>>>> called "AssertProperty" which could make sure to only perform updates >>>>>>>>> that >>>>>>>>> are inline with materialized views. The additional requirement can be >>>>>>>>> seen >>>>>>>>> as a general extension which does not need to be changed if we decide >>>>>>>>> to >>>>>>>>> got with Option 1 in the future. >>>>>>>>> >>>>>>>>> Let me know what you think. >>>>>>>>> >>>>>>>>> Best wishes, >>>>>>>>> >>>>>>>>> Jan >>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote: >>>>>>>>> >>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing >>>>>>>>> metadata definitions and minimizing spec changes are very important. >>>>>>>>> This >>>>>>>>> also minimizes spec drift (between materialized views and views spec, >>>>>>>>> and >>>>>>>>> between materialized views and tables spec), and simplifies the >>>>>>>>> implementation. >>>>>>>>> >>>>>>>>> In an effort to take the discussion forward with concrete design >>>>>>>>> options based on an end-to-end implementation, I have prototyped the >>>>>>>>> implementation (and added Spark support) in this PR >>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps us >>>>>>>>> reach convergence faster. More details about some of the design >>>>>>>>> options are >>>>>>>>> discussed in the description of the PR. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Walaa. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>>>> >>>>>>>>>> I mean separate table and view metadata that is somehow combined >>>>>>>>>> through a commit process. For instance, keeping a pointer to a table >>>>>>>>>> metadata file in a view metadata file or combining commits to >>>>>>>>>> reference >>>>>>>>>> both. I don't see the value in either option. >>>>>>>>>> >>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Ryan for the help to trace back to the root question! >>>>>>>>>>> Just a clarification question regarding your reply before I reply >>>>>>>>>>> further: >>>>>>>>>>> what exactly does the option "a combination of the two (i.e. >>>>>>>>>>> commits are >>>>>>>>>>> combined)" mean? How is that different from "a new metadata type"? >>>>>>>>>>> >>>>>>>>>>> -Jack >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I’m catching up on this conversation, so hopefully I can bring >>>>>>>>>>>> a fresh perspective. >>>>>>>>>>>> >>>>>>>>>>>> Jack already pointed out that we need to start from the basics >>>>>>>>>>>> and I agree with that. Let’s remove voting at this point. Right >>>>>>>>>>>> now is the >>>>>>>>>>>> time for discussing trade-offs, not lining up and taking sides. I >>>>>>>>>>>> realize >>>>>>>>>>>> that wasn’t the intent with adding a vote, but that’s almost >>>>>>>>>>>> always the >>>>>>>>>>>> result. It’s too easy to use it as a stand-in for consensus and >>>>>>>>>>>> move on >>>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that >>>>>>>>>>>> discussion >>>>>>>>>>>> has moved ahead of agreement. >>>>>>>>>>>> >>>>>>>>>>>> We’re still at the most basic question: is a materialized view >>>>>>>>>>>> a view and a separate table, a combination of the two (i.e. >>>>>>>>>>>> commits are >>>>>>>>>>>> combined), or a new metadata type? >>>>>>>>>>>> >>>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some kind >>>>>>>>>>>> of “system table” (meaning hidden?) or if it is exposed in the >>>>>>>>>>>> catalog. >>>>>>>>>>>> That’s a later choice (already pointed out) and, I suspect, it >>>>>>>>>>>> should be >>>>>>>>>>>> delegated to catalog implementations. >>>>>>>>>>>> >>>>>>>>>>>> To simplify this a little, I think that we can eliminate the >>>>>>>>>>>> option to combine table and view commits. I don’t think there is a >>>>>>>>>>>> reason >>>>>>>>>>>> to combine the two. If separate, a table would track the view >>>>>>>>>>>> version used >>>>>>>>>>>> along with freshness information for referenced tables. If the >>>>>>>>>>>> table is >>>>>>>>>>>> automatically skipped when the version no longer matches the view, >>>>>>>>>>>> then no >>>>>>>>>>>> action needs to happen when a view definition changes. Similarly, >>>>>>>>>>>> the table >>>>>>>>>>>> can be updated independently without needing to also swap view >>>>>>>>>>>> metadata. >>>>>>>>>>>> This also aligns with the idea from the original doc that there >>>>>>>>>>>> can be >>>>>>>>>>>> multiple materialization tables for a view. Each should operate >>>>>>>>>>>> independently unless I’m missing something >>>>>>>>>>>> >>>>>>>>>>>> I don’t think the last paragraph’s conclusion is contentious so >>>>>>>>>>>> I’ll move on, but please stop here and reply if you disagree! >>>>>>>>>>>> >>>>>>>>>>>> That leaves the main two options, a view and a separate table >>>>>>>>>>>> linked by metadata, or, combined materialized view metadata. >>>>>>>>>>>> >>>>>>>>>>>> As the doc notes, the separate view and table option is simpler >>>>>>>>>>>> because it reuses existing metadata definitions and falls back to >>>>>>>>>>>> simple >>>>>>>>>>>> views. That is a significantly smaller spec and small is very, very >>>>>>>>>>>> important when it comes to specs. I think that the argument for a >>>>>>>>>>>> new >>>>>>>>>>>> definition of a materialized view needs to overcome this >>>>>>>>>>>> disadvantage. >>>>>>>>>>>> >>>>>>>>>>>> The arguments that I see for a combined materialized view >>>>>>>>>>>> object are: >>>>>>>>>>>> >>>>>>>>>>>> - Regular views are separate, rather than being tables with >>>>>>>>>>>> SQL and no data so it would be inconsistent (“Iceberg view is >>>>>>>>>>>> just a table >>>>>>>>>>>> with no data but with representations defined. But we did not >>>>>>>>>>>> do that.”) >>>>>>>>>>>> - Materialized views are different objects in DDL >>>>>>>>>>>> - Tables may be a superset of functionality needed for >>>>>>>>>>>> materialized views >>>>>>>>>>>> - Tables are not typically exposed to end users — but this >>>>>>>>>>>> isn’t required by the separate view and table option >>>>>>>>>>>> >>>>>>>>>>>> Am I missing any arguments for combined metadata? >>>>>>>>>>>> >>>>>>>>>>>> Ryan >>>>>>>>>>>> -- >>>>>>>>>>>> Ryan Blue >>>>>>>>>>>> Tabular >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Tabular >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>>> >> >> -- >> Ryan Blue >> Tabular >> >