Hi: Sorry I didn't make it to join the last community sync. Did we reach any conclusion about mv spec?
On Tue, Mar 5, 2024 at 11:28 PM himadri pal <meh...@gmail.com> wrote: > For me the calendar link did not work in mobile, but I was able to add the > dev Google calendar from > https://iceberg.apache.org/community/#iceberg-community-events by > accessing it from laptop. > > Regards, > Himadri Pal > > > On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> > wrote: > >> Thanks Jack! I think the images are stripped from the message, but they >> are there on the doc >> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0> >> if >> someone wants to check them out (I have left some comments while there). >> >> Also I no longer see the community sync calendar >> https://iceberg.apache.org/community/#slack, so it is unclear when the >> meeting is (and we do not have the link). >> >> Thanks, >> Walaa. >> >> >> On Mon, Mar 4, 2024 at 9:58 AM Jack Ye <yezhao...@gmail.com> wrote: >> >>> Thanks Jan! +1 for everyone to take a look before the discussion, and >>> see if there are any missing options or major arguments. >>> >>> I have also added the images regarding all the options, it might be >>> easier to parse than the big sheet. I will also put it here for people that >>> do not have time to read through it: >>> >>> >>> *Option 1: Add storage table identifier in view metadata content* >>> >>> [image: MV option 1.png] >>> *Option 2: Add storage table metadata file pointer in view object* >>> >>> [image: MV option 2.png] >>> *Option 3: Add storage table metadata file pointer in view metadata >>> content* >>> >>> [image: MV option 3.png] >>> >>> *Option 4: Embed table metadata in view metadata content* >>> >>> [image: MV option 4.png] >>> *Option 5: New MV spec, MV object has table and view metadata file >>> pointers* >>> >>> [image: MV option 5.png] >>> *Option 6: New MV spec, MV metadata content embeds table and view >>> metadata* >>> >>> [image: MV option 6.png] >>> *Option 7: New MV spec, completely new MV metadata content* >>> >>> [image: MV option 7.png] >>> >>> -Jack >>> >>> >>> On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul <jank...@mailbox.org.invalid> >>> wrote: >>> >>>> I think it's great to have a face to face discussion about this. >>>> Additionally, I would propose to use Jacks' document >>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0> >>>> as a common ground for the discussion and that everyone has a quick look >>>> before the next community sync. If you think the document is still missing >>>> some arguments, please make suggestions to add them. This way we have to >>>> spend less time to get everyone up to speed and have a more common >>>> terminology. >>>> >>>> Looking forward to the discussion, best wishes >>>> >>>> Jan >>>> On 02.03.24 02:06, Walaa Eldin Moustafa wrote: >>>> >>>> The calendar on the site is currently broken >>>> https://iceberg.apache.org/community/#iceberg-community-events. Might >>>> help to fix it or share the meeting link here. >>>> >>>> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> Sounds good, let's discuss this in person! >>>>> >>>>> I am a bit worried that we have quite a few critical topics going on >>>>> right now on devlist, and this will take up a lot of time to discuss. If >>>>> it >>>>> ends up going for too long, l propose let us have a dedicated meeting, and >>>>> I am more than happy to organize it. >>>>> >>>>> Best, >>>>> Jack Ye >>>>> >>>>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> Hey everyone, >>>>>> >>>>>> I think this thread has hit a point of diminishing returns and that >>>>>> we still don't have a common understanding of what the options under >>>>>> consideration actually are. >>>>>> >>>>>> Since we were already planning on discussing this at the next >>>>>> community sync, I suggest we pick this up there and use that time to >>>>>> align >>>>>> on what exactly we're considering. We can then start a new thread to lay >>>>>> out the designs under consideration in more detail and then have a >>>>>> discussion about trade-offs. >>>>>> >>>>>> Does that sound reasonable? >>>>>> >>>>>> Ryan >>>>>> >>>>>> >>>>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>>> I am finding it hard to interpret the options concretely. I would >>>>>>> also suggest breaking the expectation/outcome to milestones. Maybe it >>>>>>> becomes easier if we agree to distinguish between an approach that is >>>>>>> feasible in the near term and another in the long term, especially if >>>>>>> the >>>>>>> latter requires significant engine-side changes. >>>>>>> >>>>>>> Further, maybe it helps if we start with an option that fully reuses >>>>>>> the existing spec, and see how we view it in comparison with the options >>>>>>> discussed previously. I am sharing one below. It reuses the current >>>>>>> spec of >>>>>>> Iceberg views and tables by leveraging table properties to capture >>>>>>> materialized view metadata. What is common (and not common) between this >>>>>>> and the desired representations? >>>>>>> >>>>>>> The new properties are: >>>>>>> Properties on a View: >>>>>>> >>>>>>> 1. >>>>>>> >>>>>>> *iceberg.materialized.view*: >>>>>>> - *Type*: View property >>>>>>> - *Purpose*: This property is used to mark whether a view is >>>>>>> a materialized view. If set to true, the view is treated as a >>>>>>> materialized view. This helps in differentiating between virtual >>>>>>> and >>>>>>> materialized views within the catalog and dictates specific >>>>>>> handling and >>>>>>> validation logic for materialized views. >>>>>>> 2. >>>>>>> >>>>>>> *iceberg.materialized.view.storage.location*: >>>>>>> - *Type*: View property >>>>>>> - *Purpose*: Specifies the location of the storage table >>>>>>> associated with the materialized view. This property is used for >>>>>>> linking a >>>>>>> materialized view with its corresponding storage table, enabling >>>>>>> data >>>>>>> management and query execution based on the stored data freshness. >>>>>>> >>>>>>> Properties on a Table: >>>>>>> >>>>>>> 1. *base.snapshot.[UUID]*: >>>>>>> - *Type*: Table property >>>>>>> - *Purpose*: These properties store the snapshot IDs of the >>>>>>> base tables at the time the materialized view's data was last >>>>>>> updated. Each >>>>>>> property is prefixed with base.snapshot. followed by the UUID >>>>>>> of the base table. They are used to track whether the >>>>>>> materialized view's >>>>>>> data is up to date with the base tables by comparing these >>>>>>> snapshot IDs >>>>>>> with the current snapshot IDs of the base tables. If all the base >>>>>>> tables' >>>>>>> current snapshot IDs match the ones stored in these properties, >>>>>>> the >>>>>>> materialized view's data is considered fresh. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> >>>>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>>> >>>>>>>> > All of these approaches are aligned in one, specific way: the >>>>>>>> storage table is an iceberg table. >>>>>>>> >>>>>>>> I do not think that is true. I think people are aligned that we >>>>>>>> would like to re-use the Iceberg table metadata defined in the Iceberg >>>>>>>> table spec to express the data in MV, but I don't think it goes that >>>>>>>> far to >>>>>>>> say it must be an Iceberg table. Once you have that mindset, then of >>>>>>>> course >>>>>>>> option 1 (separate table and view) is the only option. >>>>>>>> >>>>>>>> > I don't think that is necessary and it significantly increases >>>>>>>> the complexity. >>>>>>>> >>>>>>>> And can you quantify what you mean by "significantly increases the >>>>>>>> complexity"? Seems like a lot of concerns are coming from the tradeoff >>>>>>>> with >>>>>>>> complexity. We probably all agree that using option 7 (a completely new >>>>>>>> metadata type) is a lot of work from scratch, that is why it is not >>>>>>>> favored. However, my understanding is that as long as we re-use the >>>>>>>> view >>>>>>>> and table metadata, then the majority of the existing logic can be >>>>>>>> reused. >>>>>>>> I think what we have gone through in Slack to draft the rough Java API >>>>>>>> shape helps here, because people can estimate the amount of effort >>>>>>>> required >>>>>>>> to implement it. And I don't think they are **significantly** more >>>>>>>> complex >>>>>>>> to implement. Could you elaborate more about the complexity that you >>>>>>>> imagine? >>>>>>>> >>>>>>>> -Jack >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks < >>>>>>>> daniel.c.we...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I feel I've been most vocal about pushing back against options 2+ >>>>>>>>> (or Ryan's categories of combined table/view, or new metadata type), >>>>>>>>> so >>>>>>>>> I'll try to expand on my reasoning. >>>>>>>>> >>>>>>>>> I understand the appeal of creating a design where we encapsulate >>>>>>>>> the view/storage from both a structural and performance standpoint, >>>>>>>>> but I >>>>>>>>> don't think that is necessary and it significantly increases the >>>>>>>>> complexity. >>>>>>>>> >>>>>>>>> All of these approaches are aligned in one, specific way: the >>>>>>>>> storage table is an iceberg table. >>>>>>>>> >>>>>>>>> Because of this, all the behaviors and requirements still apply to >>>>>>>>> these tables. They need to be maintained (snapshot cleanup, orphan >>>>>>>>> files), >>>>>>>>> in cases need to be optimized (compaction, manifest rewrites), they >>>>>>>>> need to >>>>>>>>> be able to be inspected (this will be even more important with MV >>>>>>>>> since >>>>>>>>> staleness can produce different results and questions will arise >>>>>>>>> about what >>>>>>>>> state the storage table was in). There may be cases where the tables >>>>>>>>> need >>>>>>>>> to be managed directly. >>>>>>>>> >>>>>>>>> Anywhere we deviate from the existing constructs/commit/access for >>>>>>>>> tables, we will ultimately have to then unwrap to re-expose the >>>>>>>>> underlying >>>>>>>>> Iceberg behavior. This creates unnecessary complexity in the >>>>>>>>> library/API >>>>>>>>> layer, which are not the primary interface users will have with >>>>>>>>> materialized views where an engine is almost entirely necessary to >>>>>>>>> interact >>>>>>>>> with the dataset. >>>>>>>>> >>>>>>>>> As to the performance concerns around option 1, I think we're >>>>>>>>> overstating the downsides. It really comes down to how many metadata >>>>>>>>> loads >>>>>>>>> are necessary and evaluating freshness would likely be the real >>>>>>>>> bottleneck >>>>>>>>> as it involves potentially loading many tables. All of the options >>>>>>>>> are on >>>>>>>>> the same order of performance for the metadata and table loads. >>>>>>>>> >>>>>>>>> As to the visibility of tables and whether they're registered in >>>>>>>>> the catalog, I think registering in the catalog is the right approach >>>>>>>>> so >>>>>>>>> that the tables are still addressable for maintenance/etc. The >>>>>>>>> visibility >>>>>>>>> of the storage table is a catalog implementation decision and >>>>>>>>> shouldn't be >>>>>>>>> a requirement of the MV spec (I can see cases for both and it isn't >>>>>>>>> necessary to dictate a behavior). >>>>>>>>> >>>>>>>>> I'm still strongly in favor of Option 1 (separate table and view) >>>>>>>>> for these reasons. >>>>>>>>> >>>>>>>>> -Dan >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> > Jack, it sounds like you’re the proponent of a combined table >>>>>>>>>> and view (rather than a new metadata spec for a materialized view). >>>>>>>>>> What is >>>>>>>>>> the main motivation? It seems like you’re convinced of that >>>>>>>>>> approach, but I >>>>>>>>>> don’t understand the advantage it brings. >>>>>>>>>> >>>>>>>>>> Sorry I have to make a Google Sheet to capture all the options we >>>>>>>>>> have discussed so far, I wanted to use the existing Google Doc, but >>>>>>>>>> it has >>>>>>>>>> really bad table/sheet support... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0 >>>>>>>>>> >>>>>>>>>> I have listed all the options, with how they are implemented and >>>>>>>>>> some important considerations we have discussed so far. Note that: >>>>>>>>>> 1. This sheet currently excludes the lineage information, which >>>>>>>>>> we can discuss more later after the current topic is resolved. >>>>>>>>>> 2. I removed the considerations for REST integration since from >>>>>>>>>> the other thread we have clarified that they should be considered >>>>>>>>>> completely separately. >>>>>>>>>> >>>>>>>>>> *Why I come as a proponent of having a new MV object with table >>>>>>>>>> and view metadata file pointer* >>>>>>>>>> >>>>>>>>>> In my sheet, there are 3 options that do not have major problems: >>>>>>>>>> Option 2: Add storage table metadata file pointer in view object >>>>>>>>>> Option 5: New MV object with table and view metadata file pointer >>>>>>>>>> Option 6: New MV spec with table and view metadata >>>>>>>>>> >>>>>>>>>> I originally excluded option 2 because I think it does not align >>>>>>>>>> with the REST spec, but after the other discussion thread about >>>>>>>>>> "Inconsistency >>>>>>>>>> between REST spec and table/view spec", I think my original concern >>>>>>>>>> no >>>>>>>>>> longer holds true so now I put it back. And based on my personal >>>>>>>>>> preference that MV is an independent object that should be separated >>>>>>>>>> from >>>>>>>>>> view and table, plus the fact that option 5 is probably less work >>>>>>>>>> than >>>>>>>>>> option 6 for implementation, that is how I come as a proponent of >>>>>>>>>> option 5 >>>>>>>>>> at this moment. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Regarding Ryan's evaluation framework * >>>>>>>>>> >>>>>>>>>> I think we need to reconcile this sheet with Ryan's evaluation >>>>>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 >>>>>>>>>> all >>>>>>>>>> under the same category of "A combination of a view and a table" >>>>>>>>>> and concludes that they don't have any advantage for the same set of >>>>>>>>>> reasons. But those reasons are not really convincing to me so let's >>>>>>>>>> talk >>>>>>>>>> about them in more detail. >>>>>>>>>> >>>>>>>>>> (1) You said "I don’t see a reason why a combined view and table >>>>>>>>>> is advantageous" as "this would cause unnecessary dependence between >>>>>>>>>> the >>>>>>>>>> view and table in catalogs." What dependency exactly do you mean >>>>>>>>>> here? And >>>>>>>>>> why is that unnecessary, given there has to be some sort of >>>>>>>>>> dependency >>>>>>>>>> anyway unless we go with option 5 or 6? >>>>>>>>>> >>>>>>>>>> (2) You said "I guess there’s an argument that you could load >>>>>>>>>> both table and view metadata locations at the same time. That hardly >>>>>>>>>> seems >>>>>>>>>> worth the trouble". I disagree with that. Catalog interaction >>>>>>>>>> performance >>>>>>>>>> is critical to at least everyone working in EMR and Athena, and MV >>>>>>>>>> itself >>>>>>>>>> as an acceleration approach needs to be as fast as possible. >>>>>>>>>> >>>>>>>>>> I have put 3 key operations in the doc that I think matters for >>>>>>>>>> MV during interactions with engine: >>>>>>>>>> 1. refreshes storage table >>>>>>>>>> 2. get the storage table of the MV >>>>>>>>>> 3. if stale, get the view SQL >>>>>>>>>> >>>>>>>>>> And option 1 clearly falls short with 4 sequential steps required >>>>>>>>>> to load a storage table. You mentioned "recent issues with adding >>>>>>>>>> views to >>>>>>>>>> the JDBC catalog" in this topic, could you explain a bit more? >>>>>>>>>> >>>>>>>>>> (3) You said "I also think that once we decide on structure, we >>>>>>>>>> can make it possible for REST catalog implementations to do smart >>>>>>>>>> things, >>>>>>>>>> in a way that doesn’t put additional requirements on the underlying >>>>>>>>>> catalog >>>>>>>>>> store." If REST is fully compatible with Iceberg spec then I have no >>>>>>>>>> problem with this statement. However, as we discussed in the other >>>>>>>>>> thread, >>>>>>>>>> it is not the case. In the current state, I think the sequence of >>>>>>>>>> action >>>>>>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) >>>>>>>>>> first, >>>>>>>>>> and then think about how REST can incorporate it or do smart things >>>>>>>>>> that >>>>>>>>>> are not Iceberg spec compliant. Do you agree with that? >>>>>>>>>> >>>>>>>>>> (4) You said the table identifier pointer "is a problem we need >>>>>>>>>> to solve generally because a materialized table needs to be able to >>>>>>>>>> track >>>>>>>>>> the upstream state of tables that were used". I don't think that is a >>>>>>>>>> reason to choose to use a table identifier pointer for a storage >>>>>>>>>> table. The >>>>>>>>>> issue is not about using a table identifier pointer. It is about >>>>>>>>>> exposing >>>>>>>>>> the storage table as a separate entity in the catalog, which is what >>>>>>>>>> people >>>>>>>>>> do not like and is already discussed in length in Jan's question 3 >>>>>>>>>> (also >>>>>>>>>> linked in the sheet). I agree with that statement, because without a >>>>>>>>>> REST >>>>>>>>>> implementation that can magically hide the storage table, this model >>>>>>>>>> adds >>>>>>>>>> additional burden regarding compliance and data governance for any >>>>>>>>>> other >>>>>>>>>> non-REST catalog implementations that are compliant to the Iceberg >>>>>>>>>> spec. >>>>>>>>>> Many mechanisms need to be built in a catalog to hide, protect, >>>>>>>>>> maintain, >>>>>>>>>> recycle the storage table, that can be avoided by using other >>>>>>>>>> approaches. I >>>>>>>>>> think we should reach a consensus about that and discuss further if >>>>>>>>>> you do >>>>>>>>>> not agree. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Jack Ye >>>>>>>>>> >>>>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul >>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Ryan, we actually discussed your categories in this question >>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>. >>>>>>>>>>> Where your categories correspond to the following designs: >>>>>>>>>>> >>>>>>>>>>> - Separate table and view => Design 1 >>>>>>>>>>> - Combination of view and table => Design 2 >>>>>>>>>>> - A new metadata type => Design 4 >>>>>>>>>>> >>>>>>>>>>> Jan >>>>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote: >>>>>>>>>>> >>>>>>>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so >>>>>>>>>>> I’ll be more specific: >>>>>>>>>>> >>>>>>>>>>> - *Separate table and view*: this option is to have the >>>>>>>>>>> objects that we have today, with extra metadata. Commit >>>>>>>>>>> processes are >>>>>>>>>>> separate: committing to the table doesn’t alter the view and >>>>>>>>>>> committing to >>>>>>>>>>> the view doesn’t change the table. However, changing the view >>>>>>>>>>> can make it >>>>>>>>>>> so the table is no longer useful as a materialization. >>>>>>>>>>> - *A combination of a view and a table*: in this option, the >>>>>>>>>>> table metadata and view metadata are the same as the first >>>>>>>>>>> option. The >>>>>>>>>>> difference is that the commit process combines them, either by >>>>>>>>>>> embedding a >>>>>>>>>>> table metadata location in view metadata or by tracking both in >>>>>>>>>>> the same >>>>>>>>>>> catalog reference. >>>>>>>>>>> - *A new metadata type*: this option is where we define a >>>>>>>>>>> new metadata object that has view attributes, like SQL >>>>>>>>>>> representations, >>>>>>>>>>> along with table attributes, like partition specs and snapshots. >>>>>>>>>>> >>>>>>>>>>> Hopefully this is clear because I think much of the confusion is >>>>>>>>>>> caused by different definitions. >>>>>>>>>>> >>>>>>>>>>> The LoadTableResponse having optional metadata-location field >>>>>>>>>>> implies that the object in the catalog no longer needs to hold a >>>>>>>>>>> metadata >>>>>>>>>>> file pointer >>>>>>>>>>> >>>>>>>>>>> The REST protocol has not removed the requirement for a metadata >>>>>>>>>>> file, so I’m going to keep focused on the MV design options. >>>>>>>>>>> >>>>>>>>>>> When we say a MV can be a “new metadata type”, it does not mean >>>>>>>>>>> it needs to define a completely brand new structure of the metadata >>>>>>>>>>> content >>>>>>>>>>> >>>>>>>>>>> I’m making a distinction between separate metadata files for the >>>>>>>>>>> table and the view and a combined metadata object, as above. >>>>>>>>>>> >>>>>>>>>>> We can define an “Iceberg MV” to be an object in a catalog, >>>>>>>>>>> which has 1 table metadata file pointer, and 1 view metadata file >>>>>>>>>>> pointer >>>>>>>>>>> >>>>>>>>>>> This is the option I am referring to as a “combination of a view >>>>>>>>>>> and a table”. >>>>>>>>>>> >>>>>>>>>>> So to review my initial email, I don’t see a reason why a >>>>>>>>>>> combined view and table is advantageous, either implemented by >>>>>>>>>>> having a >>>>>>>>>>> catalog reference with two metadata locations or embedding a table >>>>>>>>>>> metadata >>>>>>>>>>> location in view metadata. This would cause unnecessary dependence >>>>>>>>>>> between >>>>>>>>>>> the view and table in catalogs. I guess there’s an argument that >>>>>>>>>>> you could >>>>>>>>>>> load both table and view metadata locations at the same time. That >>>>>>>>>>> hardly >>>>>>>>>>> seems worth the trouble given the recent issues with adding views >>>>>>>>>>> to the >>>>>>>>>>> JDBC catalog. >>>>>>>>>>> >>>>>>>>>>> I also think that once we decide on structure, we can make it >>>>>>>>>>> possible for REST catalog implementations to do smart things, in a >>>>>>>>>>> way that >>>>>>>>>>> doesn’t put additional requirements on the underlying catalog >>>>>>>>>>> store. For >>>>>>>>>>> instance, we could specify how to send additional objects in a >>>>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table >>>>>>>>>>> metadata. I >>>>>>>>>>> think these optimizations are a later addition, after we define the >>>>>>>>>>> relationship between views and tables. >>>>>>>>>>> >>>>>>>>>>> Jack, it sounds like you’re the proponent of a combined table >>>>>>>>>>> and view (rather than a new metadata spec for a materialized view). >>>>>>>>>>> What is >>>>>>>>>>> the main motivation? It seems like you’re convinced of that >>>>>>>>>>> approach, but I >>>>>>>>>>> don’t understand the advantage it brings. >>>>>>>>>>> >>>>>>>>>>> Ryan >>>>>>>>>>> >>>>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho < >>>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi >>>>>>>>>>>> >>>>>>>>>>>> Yes I mostly agree with the assessment. To clarify a few minor >>>>>>>>>>>> points. >>>>>>>>>>>> >>>>>>>>>>>> is a materialized view a view and a separate table, a >>>>>>>>>>>>> combination of the two (i.e. commits are combined), or a new >>>>>>>>>>>>> metadata type? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> For 'new metadata type', I consider mostly Jack's initial >>>>>>>>>>>> proposal of a new Catalog MV object that has two references >>>>>>>>>>>> (ViewMetadata + >>>>>>>>>>>> TableMetadata). >>>>>>>>>>>> >>>>>>>>>>>> The arguments that I see for a combined materialized view >>>>>>>>>>>>> object are: >>>>>>>>>>>>> >>>>>>>>>>>>> - Regular views are separate, rather than being tables >>>>>>>>>>>>> with SQL and no data so it would be inconsistent (“Iceberg >>>>>>>>>>>>> view is just a >>>>>>>>>>>>> table with no data but with representations defined. But we >>>>>>>>>>>>> did not do >>>>>>>>>>>>> that.”) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - Materialized views are different objects in DDL >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - Tables may be a superset of functionality needed for >>>>>>>>>>>>> materialized views >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - Tables are not typically exposed to end users — but this >>>>>>>>>>>>> isn’t required by the separate view and table option >>>>>>>>>>>>> >>>>>>>>>>>>> For completeness, there seem to be a few additional ones >>>>>>>>>>>> (mentioned in the Slack and above messages). >>>>>>>>>>>> >>>>>>>>>>>> - Lack of spec change (to ViewMetadata). But as Jack says >>>>>>>>>>>> it is a spec change (ie, to catalogs) >>>>>>>>>>>> - A single call to get the View's StorageTable (versus two >>>>>>>>>>>> calls) >>>>>>>>>>>> - A more natural API, no opportunity for user to call >>>>>>>>>>>> Catalog.dropTable() and renameTable() on storage table >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Thoughts: *I think the long discussion sessions we had on >>>>>>>>>>>> Slack was fruitful for me, as seeing the API clarified some things. >>>>>>>>>>>> >>>>>>>>>>>> I was initially more in favor of MV being a new metadata type >>>>>>>>>>>> (TableMetadata + ViewMetadata). But seeing most of the MV >>>>>>>>>>>> operations end >>>>>>>>>>>> up being ViewCatalog or Catalog operations, I am starting to think >>>>>>>>>>>> API-wise >>>>>>>>>>>> that it may not align with the new metadata type (unless we define >>>>>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate >>>>>>>>>>>> wrappers). >>>>>>>>>>>> >>>>>>>>>>>> Initially one question I had for option 'a view and a separate >>>>>>>>>>>> table', was how to make this table reference (metadata.json or >>>>>>>>>>>> catalog >>>>>>>>>>>> reference). In the previous option, we had a precedent of Catalog >>>>>>>>>>>> references to Metadata, but not pointers between Metadatas. I >>>>>>>>>>>> initially >>>>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' >>>>>>>>>>>> catalog >>>>>>>>>>>> concerns in ViewMetadata. (I saw Catalog and ViewCatalog as a >>>>>>>>>>>> layer above >>>>>>>>>>>> TableMetadata and ViewMetadata). But I think Dan in the Slack >>>>>>>>>>>> made a fair >>>>>>>>>>>> point that ViewMetadata already is tightly bound with a Catalog. >>>>>>>>>>>> In this >>>>>>>>>>>> case, I think this approach does have its merits as well in >>>>>>>>>>>> aligning >>>>>>>>>>>> Catalog API's with the metadata. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Szehon >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul >>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> I would like to provide my perspective on the question of what >>>>>>>>>>>>> a materialized view is and elaborate on Jack's recent proposal to >>>>>>>>>>>>> view a >>>>>>>>>>>>> materialized view as a catalog concept. >>>>>>>>>>>>> >>>>>>>>>>>>> Firstly, let's look at the role of the catalog. Every entity >>>>>>>>>>>>> in the catalog has a *unique identifier*, and the catalog >>>>>>>>>>>>> provides methods to create, load, and update these entities. An >>>>>>>>>>>>> important >>>>>>>>>>>>> thing to note is that the catalog methods exhibit two different >>>>>>>>>>>>> behaviors: >>>>>>>>>>>>> the *create and load methods deal with the entire entity*, >>>>>>>>>>>>> while the *update(commit) method only deals with partial >>>>>>>>>>>>> changes* to the entities. >>>>>>>>>>>>> >>>>>>>>>>>>> In the context of our current discussion, materialized view >>>>>>>>>>>>> (MV) metadata is a union of view and table metadata. The fact >>>>>>>>>>>>> that the >>>>>>>>>>>>> update method deals only with partial changes, enables us to >>>>>>>>>>>>> *reuse >>>>>>>>>>>>> the existing methods for updating tables and views*. For >>>>>>>>>>>>> updates we don't have to define what constitutes an entire >>>>>>>>>>>>> materialized >>>>>>>>>>>>> view. Changes to a materialized view targeting the properties >>>>>>>>>>>>> related to >>>>>>>>>>>>> the view metadata could use the update(commit) view method. >>>>>>>>>>>>> Similarly, >>>>>>>>>>>>> changes targeting the properties related to the table metadata >>>>>>>>>>>>> could use >>>>>>>>>>>>> the update(commit) table method. This is great news because we >>>>>>>>>>>>> don't have >>>>>>>>>>>>> to redefine view and table commits (requirements, updates). >>>>>>>>>>>>> This is shown in the fact that Jack uses the same operation to >>>>>>>>>>>>> update the storage table for Option 1 and 3: >>>>>>>>>>>>> >>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true >>>>>>>>>>>>> // non-REST: update JSON files at table_metadata_location >>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>>>>>>> >>>>>>>>>>>>> The open question is *whether the create and load methods >>>>>>>>>>>>> should treat the properties that constitute the MV metadata as >>>>>>>>>>>>> two entities >>>>>>>>>>>>> (View + Table) or one entity (new MV object)*. This is all >>>>>>>>>>>>> part of Jack's proposal, where Option 1 proposes a new MV object, >>>>>>>>>>>>> and >>>>>>>>>>>>> Option 3 proposes two separate entities. The advantage of Option >>>>>>>>>>>>> 1 is that >>>>>>>>>>>>> it doesn't require two operations to load the metadata. On the >>>>>>>>>>>>> other hand, >>>>>>>>>>>>> the advantage of Option 3 is that no new operations or catalogs >>>>>>>>>>>>> have to be >>>>>>>>>>>>> defined. >>>>>>>>>>>>> >>>>>>>>>>>>> In my opinion, defining a new representation for materialized >>>>>>>>>>>>> views (Option 1) is generally the cleaner solution. However, I >>>>>>>>>>>>> see a path >>>>>>>>>>>>> where we could first introduce Option 3 and still have the >>>>>>>>>>>>> possibility to >>>>>>>>>>>>> transition to Option 1 if needed. The great thing about Option 3 >>>>>>>>>>>>> is that it >>>>>>>>>>>>> only requires minor changes to the current spec and is mostly >>>>>>>>>>>>> implementation detail. >>>>>>>>>>>>> >>>>>>>>>>>>> Therefore I would propose small additions to Jacks Option 3 >>>>>>>>>>>>> that only introduce changes to the spec that are not specific to >>>>>>>>>>>>> materialized views. The idea is to introduce boolean properties >>>>>>>>>>>>> to be set >>>>>>>>>>>>> on the creation of the view and the storage table that indicate >>>>>>>>>>>>> that they >>>>>>>>>>>>> belong to a materialized view. The view property "materialized" >>>>>>>>>>>>> is set to >>>>>>>>>>>>> "true" for a MV and "false" for a regular view. And the table >>>>>>>>>>>>> property >>>>>>>>>>>>> "storage_table" is set to "true" for a storage table and "false" >>>>>>>>>>>>> for a >>>>>>>>>>>>> regular table. The absence of these properties indicates a >>>>>>>>>>>>> regular view or >>>>>>>>>>>>> table. >>>>>>>>>>>>> >>>>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog; >>>>>>>>>>>>> >>>>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1 >>>>>>>>>>>>> // non-REST: load JSON file at metadata_location >>>>>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", >>>>>>>>>>>>> "mv1")); >>>>>>>>>>>>> >>>>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1 >>>>>>>>>>>>> // non-REST: load JSON file at table_metadata_location if >>>>>>>>>>>>> present >>>>>>>>>>>>> Table storageTable = view.storageTable(); >>>>>>>>>>>>> >>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1 >>>>>>>>>>>>> // non-REST: update JSON file at table_metadata_location >>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>>>>>>> >>>>>>>>>>>>> We could then introduce a new requirement for views and tables >>>>>>>>>>>>> called "AssertProperty" which could make sure to only perform >>>>>>>>>>>>> updates that >>>>>>>>>>>>> are inline with materialized views. The additional requirement >>>>>>>>>>>>> can be seen >>>>>>>>>>>>> as a general extension which does not need to be changed if we >>>>>>>>>>>>> decide to >>>>>>>>>>>>> got with Option 1 in the future. >>>>>>>>>>>>> >>>>>>>>>>>>> Let me know what you think. >>>>>>>>>>>>> >>>>>>>>>>>>> Best wishes, >>>>>>>>>>>>> >>>>>>>>>>>>> Jan >>>>>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing >>>>>>>>>>>>> metadata definitions and minimizing spec changes are very >>>>>>>>>>>>> important. This >>>>>>>>>>>>> also minimizes spec drift (between materialized views and views >>>>>>>>>>>>> spec, and >>>>>>>>>>>>> between materialized views and tables spec), and simplifies the >>>>>>>>>>>>> implementation. >>>>>>>>>>>>> >>>>>>>>>>>>> In an effort to take the discussion forward with concrete >>>>>>>>>>>>> design options based on an end-to-end implementation, I have >>>>>>>>>>>>> prototyped the >>>>>>>>>>>>> implementation (and added Spark support) in this PR >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps >>>>>>>>>>>>> us reach convergence faster. More details about some of the >>>>>>>>>>>>> design options >>>>>>>>>>>>> are discussed in the description of the PR. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Walaa. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I mean separate table and view metadata that is somehow >>>>>>>>>>>>>> combined through a commit process. For instance, keeping a >>>>>>>>>>>>>> pointer to a >>>>>>>>>>>>>> table metadata file in a view metadata file or combining commits >>>>>>>>>>>>>> to >>>>>>>>>>>>>> reference both. I don't see the value in either option. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks Ryan for the help to trace back to the root question! >>>>>>>>>>>>>>> Just a clarification question regarding your reply before I >>>>>>>>>>>>>>> reply further: >>>>>>>>>>>>>>> what exactly does the option "a combination of the two (i.e. >>>>>>>>>>>>>>> commits are >>>>>>>>>>>>>>> combined)" mean? How is that different from "a new metadata >>>>>>>>>>>>>>> type"? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Jack >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I’m catching up on this conversation, so hopefully I can >>>>>>>>>>>>>>>> bring a fresh perspective. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Jack already pointed out that we need to start from the >>>>>>>>>>>>>>>> basics and I agree with that. Let’s remove voting at this >>>>>>>>>>>>>>>> point. Right now >>>>>>>>>>>>>>>> is the time for discussing trade-offs, not lining up and >>>>>>>>>>>>>>>> taking sides. I >>>>>>>>>>>>>>>> realize that wasn’t the intent with adding a vote, but that’s >>>>>>>>>>>>>>>> almost always >>>>>>>>>>>>>>>> the result. It’s too easy to use it as a stand-in for >>>>>>>>>>>>>>>> consensus and move on >>>>>>>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that >>>>>>>>>>>>>>>> discussion >>>>>>>>>>>>>>>> has moved ahead of agreement. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We’re still at the most basic question: is a materialized >>>>>>>>>>>>>>>> view a view and a separate table, a combination of the two >>>>>>>>>>>>>>>> (i.e. commits >>>>>>>>>>>>>>>> are combined), or a new metadata type? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some >>>>>>>>>>>>>>>> kind of “system table” (meaning hidden?) or if it is exposed >>>>>>>>>>>>>>>> in the >>>>>>>>>>>>>>>> catalog. That’s a later choice (already pointed out) and, I >>>>>>>>>>>>>>>> suspect, it >>>>>>>>>>>>>>>> should be delegated to catalog implementations. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> To simplify this a little, I think that we can eliminate >>>>>>>>>>>>>>>> the option to combine table and view commits. I don’t think >>>>>>>>>>>>>>>> there is a >>>>>>>>>>>>>>>> reason to combine the two. If separate, a table would track >>>>>>>>>>>>>>>> the view >>>>>>>>>>>>>>>> version used along with freshness information for referenced >>>>>>>>>>>>>>>> tables. If the >>>>>>>>>>>>>>>> table is automatically skipped when the version no longer >>>>>>>>>>>>>>>> matches the view, >>>>>>>>>>>>>>>> then no action needs to happen when a view definition changes. >>>>>>>>>>>>>>>> Similarly, >>>>>>>>>>>>>>>> the table can be updated independentl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>