For me the calendar link did not work in mobile, but I was able to add the dev Google calendar from https://iceberg.apache.org/community/#iceberg-community-events by accessing it from laptop.
Regards, Himadri Pal On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Thanks Jack! I think the images are stripped from the message, but they > are there on the doc > <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0> > if > someone wants to check them out (I have left some comments while there). > > Also I no longer see the community sync calendar > https://iceberg.apache.org/community/#slack, so it is unclear when the > meeting is (and we do not have the link). > > Thanks, > Walaa. > > > On Mon, Mar 4, 2024 at 9:58 AM Jack Ye <yezhao...@gmail.com> wrote: > >> Thanks Jan! +1 for everyone to take a look before the discussion, and see >> if there are any missing options or major arguments. >> >> I have also added the images regarding all the options, it might be >> easier to parse than the big sheet. I will also put it here for people that >> do not have time to read through it: >> >> >> *Option 1: Add storage table identifier in view metadata content* >> >> [image: MV option 1.png] >> *Option 2: Add storage table metadata file pointer in view object* >> >> [image: MV option 2.png] >> *Option 3: Add storage table metadata file pointer in view metadata >> content* >> >> [image: MV option 3.png] >> >> *Option 4: Embed table metadata in view metadata content* >> >> [image: MV option 4.png] >> *Option 5: New MV spec, MV object has table and view metadata file >> pointers* >> >> [image: MV option 5.png] >> *Option 6: New MV spec, MV metadata content embeds table and view >> metadata* >> >> [image: MV option 6.png] >> *Option 7: New MV spec, completely new MV metadata content* >> >> [image: MV option 7.png] >> >> -Jack >> >> >> On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul <jank...@mailbox.org.invalid> >> wrote: >> >>> I think it's great to have a face to face discussion about this. >>> Additionally, I would propose to use Jacks' document >>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0> >>> as a common ground for the discussion and that everyone has a quick look >>> before the next community sync. If you think the document is still missing >>> some arguments, please make suggestions to add them. This way we have to >>> spend less time to get everyone up to speed and have a more common >>> terminology. >>> >>> Looking forward to the discussion, best wishes >>> >>> Jan >>> On 02.03.24 02:06, Walaa Eldin Moustafa wrote: >>> >>> The calendar on the site is currently broken >>> https://iceberg.apache.org/community/#iceberg-community-events. Might >>> help to fix it or share the meeting link here. >>> >>> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Sounds good, let's discuss this in person! >>>> >>>> I am a bit worried that we have quite a few critical topics going on >>>> right now on devlist, and this will take up a lot of time to discuss. If it >>>> ends up going for too long, l propose let us have a dedicated meeting, and >>>> I am more than happy to organize it. >>>> >>>> Best, >>>> Jack Ye >>>> >>>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> Hey everyone, >>>>> >>>>> I think this thread has hit a point of diminishing returns and that we >>>>> still don't have a common understanding of what the options under >>>>> consideration actually are. >>>>> >>>>> Since we were already planning on discussing this at the next >>>>> community sync, I suggest we pick this up there and use that time to align >>>>> on what exactly we're considering. We can then start a new thread to lay >>>>> out the designs under consideration in more detail and then have a >>>>> discussion about trade-offs. >>>>> >>>>> Does that sound reasonable? >>>>> >>>>> Ryan >>>>> >>>>> >>>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa < >>>>> wa.moust...@gmail.com> wrote: >>>>> >>>>>> I am finding it hard to interpret the options concretely. I would >>>>>> also suggest breaking the expectation/outcome to milestones. Maybe it >>>>>> becomes easier if we agree to distinguish between an approach that is >>>>>> feasible in the near term and another in the long term, especially if the >>>>>> latter requires significant engine-side changes. >>>>>> >>>>>> Further, maybe it helps if we start with an option that fully reuses >>>>>> the existing spec, and see how we view it in comparison with the options >>>>>> discussed previously. I am sharing one below. It reuses the current spec >>>>>> of >>>>>> Iceberg views and tables by leveraging table properties to capture >>>>>> materialized view metadata. What is common (and not common) between this >>>>>> and the desired representations? >>>>>> >>>>>> The new properties are: >>>>>> Properties on a View: >>>>>> >>>>>> 1. >>>>>> >>>>>> *iceberg.materialized.view*: >>>>>> - *Type*: View property >>>>>> - *Purpose*: This property is used to mark whether a view is a >>>>>> materialized view. If set to true, the view is treated as a >>>>>> materialized view. This helps in differentiating between virtual >>>>>> and >>>>>> materialized views within the catalog and dictates specific >>>>>> handling and >>>>>> validation logic for materialized views. >>>>>> 2. >>>>>> >>>>>> *iceberg.materialized.view.storage.location*: >>>>>> - *Type*: View property >>>>>> - *Purpose*: Specifies the location of the storage table >>>>>> associated with the materialized view. This property is used for >>>>>> linking a >>>>>> materialized view with its corresponding storage table, enabling >>>>>> data >>>>>> management and query execution based on the stored data freshness. >>>>>> >>>>>> Properties on a Table: >>>>>> >>>>>> 1. *base.snapshot.[UUID]*: >>>>>> - *Type*: Table property >>>>>> - *Purpose*: These properties store the snapshot IDs of the >>>>>> base tables at the time the materialized view's data was last >>>>>> updated. Each >>>>>> property is prefixed with base.snapshot. followed by the UUID >>>>>> of the base table. They are used to track whether the materialized >>>>>> view's >>>>>> data is up to date with the base tables by comparing these >>>>>> snapshot IDs >>>>>> with the current snapshot IDs of the base tables. If all the base >>>>>> tables' >>>>>> current snapshot IDs match the ones stored in these properties, the >>>>>> materialized view's data is considered fresh. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Walaa. >>>>>> >>>>>> >>>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> > All of these approaches are aligned in one, specific way: the >>>>>>> storage table is an iceberg table. >>>>>>> >>>>>>> I do not think that is true. I think people are aligned that we >>>>>>> would like to re-use the Iceberg table metadata defined in the Iceberg >>>>>>> table spec to express the data in MV, but I don't think it goes that >>>>>>> far to >>>>>>> say it must be an Iceberg table. Once you have that mindset, then of >>>>>>> course >>>>>>> option 1 (separate table and view) is the only option. >>>>>>> >>>>>>> > I don't think that is necessary and it significantly increases the >>>>>>> complexity. >>>>>>> >>>>>>> And can you quantify what you mean by "significantly increases the >>>>>>> complexity"? Seems like a lot of concerns are coming from the tradeoff >>>>>>> with >>>>>>> complexity. We probably all agree that using option 7 (a completely new >>>>>>> metadata type) is a lot of work from scratch, that is why it is not >>>>>>> favored. However, my understanding is that as long as we re-use the view >>>>>>> and table metadata, then the majority of the existing logic can be >>>>>>> reused. >>>>>>> I think what we have gone through in Slack to draft the rough Java API >>>>>>> shape helps here, because people can estimate the amount of effort >>>>>>> required >>>>>>> to implement it. And I don't think they are **significantly** more >>>>>>> complex >>>>>>> to implement. Could you elaborate more about the complexity that you >>>>>>> imagine? >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks < >>>>>>> daniel.c.we...@gmail.com> wrote: >>>>>>> >>>>>>>> I feel I've been most vocal about pushing back against options 2+ >>>>>>>> (or Ryan's categories of combined table/view, or new metadata type), so >>>>>>>> I'll try to expand on my reasoning. >>>>>>>> >>>>>>>> I understand the appeal of creating a design where we encapsulate >>>>>>>> the view/storage from both a structural and performance standpoint, >>>>>>>> but I >>>>>>>> don't think that is necessary and it significantly increases the >>>>>>>> complexity. >>>>>>>> >>>>>>>> All of these approaches are aligned in one, specific way: the >>>>>>>> storage table is an iceberg table. >>>>>>>> >>>>>>>> Because of this, all the behaviors and requirements still apply to >>>>>>>> these tables. They need to be maintained (snapshot cleanup, orphan >>>>>>>> files), >>>>>>>> in cases need to be optimized (compaction, manifest rewrites), they >>>>>>>> need to >>>>>>>> be able to be inspected (this will be even more important with MV since >>>>>>>> staleness can produce different results and questions will arise about >>>>>>>> what >>>>>>>> state the storage table was in). There may be cases where the tables >>>>>>>> need >>>>>>>> to be managed directly. >>>>>>>> >>>>>>>> Anywhere we deviate from the existing constructs/commit/access for >>>>>>>> tables, we will ultimately have to then unwrap to re-expose the >>>>>>>> underlying >>>>>>>> Iceberg behavior. This creates unnecessary complexity in the >>>>>>>> library/API >>>>>>>> layer, which are not the primary interface users will have with >>>>>>>> materialized views where an engine is almost entirely necessary to >>>>>>>> interact >>>>>>>> with the dataset. >>>>>>>> >>>>>>>> As to the performance concerns around option 1, I think we're >>>>>>>> overstating the downsides. It really comes down to how many metadata >>>>>>>> loads >>>>>>>> are necessary and evaluating freshness would likely be the real >>>>>>>> bottleneck >>>>>>>> as it involves potentially loading many tables. All of the options >>>>>>>> are on >>>>>>>> the same order of performance for the metadata and table loads. >>>>>>>> >>>>>>>> As to the visibility of tables and whether they're registered in >>>>>>>> the catalog, I think registering in the catalog is the right approach >>>>>>>> so >>>>>>>> that the tables are still addressable for maintenance/etc. The >>>>>>>> visibility >>>>>>>> of the storage table is a catalog implementation decision and >>>>>>>> shouldn't be >>>>>>>> a requirement of the MV spec (I can see cases for both and it isn't >>>>>>>> necessary to dictate a behavior). >>>>>>>> >>>>>>>> I'm still strongly in favor of Option 1 (separate table and view) >>>>>>>> for these reasons. >>>>>>>> >>>>>>>> -Dan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> > Jack, it sounds like you’re the proponent of a combined table >>>>>>>>> and view (rather than a new metadata spec for a materialized view). >>>>>>>>> What is >>>>>>>>> the main motivation? It seems like you’re convinced of that approach, >>>>>>>>> but I >>>>>>>>> don’t understand the advantage it brings. >>>>>>>>> >>>>>>>>> Sorry I have to make a Google Sheet to capture all the options we >>>>>>>>> have discussed so far, I wanted to use the existing Google Doc, but >>>>>>>>> it has >>>>>>>>> really bad table/sheet support... >>>>>>>>> >>>>>>>>> >>>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0 >>>>>>>>> >>>>>>>>> I have listed all the options, with how they are implemented and >>>>>>>>> some important considerations we have discussed so far. Note that: >>>>>>>>> 1. This sheet currently excludes the lineage information, which we >>>>>>>>> can discuss more later after the current topic is resolved. >>>>>>>>> 2. I removed the considerations for REST integration since from >>>>>>>>> the other thread we have clarified that they should be considered >>>>>>>>> completely separately. >>>>>>>>> >>>>>>>>> *Why I come as a proponent of having a new MV object with table >>>>>>>>> and view metadata file pointer* >>>>>>>>> >>>>>>>>> In my sheet, there are 3 options that do not have major problems: >>>>>>>>> Option 2: Add storage table metadata file pointer in view object >>>>>>>>> Option 5: New MV object with table and view metadata file pointer >>>>>>>>> Option 6: New MV spec with table and view metadata >>>>>>>>> >>>>>>>>> I originally excluded option 2 because I think it does not align >>>>>>>>> with the REST spec, but after the other discussion thread about >>>>>>>>> "Inconsistency >>>>>>>>> between REST spec and table/view spec", I think my original concern no >>>>>>>>> longer holds true so now I put it back. And based on my personal >>>>>>>>> preference that MV is an independent object that should be separated >>>>>>>>> from >>>>>>>>> view and table, plus the fact that option 5 is probably less work than >>>>>>>>> option 6 for implementation, that is how I come as a proponent of >>>>>>>>> option 5 >>>>>>>>> at this moment. >>>>>>>>> >>>>>>>>> >>>>>>>>> *Regarding Ryan's evaluation framework * >>>>>>>>> >>>>>>>>> I think we need to reconcile this sheet with Ryan's evaluation >>>>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 all >>>>>>>>> under the same category of "A combination of a view and a table" >>>>>>>>> and concludes that they don't have any advantage for the same set of >>>>>>>>> reasons. But those reasons are not really convincing to me so let's >>>>>>>>> talk >>>>>>>>> about them in more detail. >>>>>>>>> >>>>>>>>> (1) You said "I don’t see a reason why a combined view and table >>>>>>>>> is advantageous" as "this would cause unnecessary dependence between >>>>>>>>> the >>>>>>>>> view and table in catalogs." What dependency exactly do you mean >>>>>>>>> here? And >>>>>>>>> why is that unnecessary, given there has to be some sort of dependency >>>>>>>>> anyway unless we go with option 5 or 6? >>>>>>>>> >>>>>>>>> (2) You said "I guess there’s an argument that you could load both >>>>>>>>> table and view metadata locations at the same time. That hardly seems >>>>>>>>> worth >>>>>>>>> the trouble". I disagree with that. Catalog interaction performance is >>>>>>>>> critical to at least everyone working in EMR and Athena, and MV >>>>>>>>> itself as >>>>>>>>> an acceleration approach needs to be as fast as possible. >>>>>>>>> >>>>>>>>> I have put 3 key operations in the doc that I think matters for MV >>>>>>>>> during interactions with engine: >>>>>>>>> 1. refreshes storage table >>>>>>>>> 2. get the storage table of the MV >>>>>>>>> 3. if stale, get the view SQL >>>>>>>>> >>>>>>>>> And option 1 clearly falls short with 4 sequential steps required >>>>>>>>> to load a storage table. You mentioned "recent issues with adding >>>>>>>>> views to >>>>>>>>> the JDBC catalog" in this topic, could you explain a bit more? >>>>>>>>> >>>>>>>>> (3) You said "I also think that once we decide on structure, we >>>>>>>>> can make it possible for REST catalog implementations to do smart >>>>>>>>> things, >>>>>>>>> in a way that doesn’t put additional requirements on the underlying >>>>>>>>> catalog >>>>>>>>> store." If REST is fully compatible with Iceberg spec then I have no >>>>>>>>> problem with this statement. However, as we discussed in the other >>>>>>>>> thread, >>>>>>>>> it is not the case. In the current state, I think the sequence of >>>>>>>>> action >>>>>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) >>>>>>>>> first, >>>>>>>>> and then think about how REST can incorporate it or do smart things >>>>>>>>> that >>>>>>>>> are not Iceberg spec compliant. Do you agree with that? >>>>>>>>> >>>>>>>>> (4) You said the table identifier pointer "is a problem we need to >>>>>>>>> solve generally because a materialized table needs to be able to >>>>>>>>> track the >>>>>>>>> upstream state of tables that were used". I don't think that is a >>>>>>>>> reason to >>>>>>>>> choose to use a table identifier pointer for a storage table. The >>>>>>>>> issue is >>>>>>>>> not about using a table identifier pointer. It is about exposing the >>>>>>>>> storage table as a separate entity in the catalog, which is what >>>>>>>>> people do >>>>>>>>> not like and is already discussed in length in Jan's question 3 (also >>>>>>>>> linked in the sheet). I agree with that statement, because without a >>>>>>>>> REST >>>>>>>>> implementation that can magically hide the storage table, this model >>>>>>>>> adds >>>>>>>>> additional burden regarding compliance and data governance for any >>>>>>>>> other >>>>>>>>> non-REST catalog implementations that are compliant to the Iceberg >>>>>>>>> spec. >>>>>>>>> Many mechanisms need to be built in a catalog to hide, protect, >>>>>>>>> maintain, >>>>>>>>> recycle the storage table, that can be avoided by using other >>>>>>>>> approaches. I >>>>>>>>> think we should reach a consensus about that and discuss further if >>>>>>>>> you do >>>>>>>>> not agree. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Jack Ye >>>>>>>>> >>>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul >>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>>> >>>>>>>>>> Hi Ryan, we actually discussed your categories in this question >>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>. >>>>>>>>>> Where your categories correspond to the following designs: >>>>>>>>>> >>>>>>>>>> - Separate table and view => Design 1 >>>>>>>>>> - Combination of view and table => Design 2 >>>>>>>>>> - A new metadata type => Design 4 >>>>>>>>>> >>>>>>>>>> Jan >>>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote: >>>>>>>>>> >>>>>>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so >>>>>>>>>> I’ll be more specific: >>>>>>>>>> >>>>>>>>>> - *Separate table and view*: this option is to have the >>>>>>>>>> objects that we have today, with extra metadata. Commit processes >>>>>>>>>> are >>>>>>>>>> separate: committing to the table doesn’t alter the view and >>>>>>>>>> committing to >>>>>>>>>> the view doesn’t change the table. However, changing the view can >>>>>>>>>> make it >>>>>>>>>> so the table is no longer useful as a materialization. >>>>>>>>>> - *A combination of a view and a table*: in this option, the >>>>>>>>>> table metadata and view metadata are the same as the first >>>>>>>>>> option. The >>>>>>>>>> difference is that the commit process combines them, either by >>>>>>>>>> embedding a >>>>>>>>>> table metadata location in view metadata or by tracking both in >>>>>>>>>> the same >>>>>>>>>> catalog reference. >>>>>>>>>> - *A new metadata type*: this option is where we define a new >>>>>>>>>> metadata object that has view attributes, like SQL >>>>>>>>>> representations, along >>>>>>>>>> with table attributes, like partition specs and snapshots. >>>>>>>>>> >>>>>>>>>> Hopefully this is clear because I think much of the confusion is >>>>>>>>>> caused by different definitions. >>>>>>>>>> >>>>>>>>>> The LoadTableResponse having optional metadata-location field >>>>>>>>>> implies that the object in the catalog no longer needs to hold a >>>>>>>>>> metadata >>>>>>>>>> file pointer >>>>>>>>>> >>>>>>>>>> The REST protocol has not removed the requirement for a metadata >>>>>>>>>> file, so I’m going to keep focused on the MV design options. >>>>>>>>>> >>>>>>>>>> When we say a MV can be a “new metadata type”, it does not mean >>>>>>>>>> it needs to define a completely brand new structure of the metadata >>>>>>>>>> content >>>>>>>>>> >>>>>>>>>> I’m making a distinction between separate metadata files for the >>>>>>>>>> table and the view and a combined metadata object, as above. >>>>>>>>>> >>>>>>>>>> We can define an “Iceberg MV” to be an object in a catalog, which >>>>>>>>>> has 1 table metadata file pointer, and 1 view metadata file pointer >>>>>>>>>> >>>>>>>>>> This is the option I am referring to as a “combination of a view >>>>>>>>>> and a table”. >>>>>>>>>> >>>>>>>>>> So to review my initial email, I don’t see a reason why a >>>>>>>>>> combined view and table is advantageous, either implemented by >>>>>>>>>> having a >>>>>>>>>> catalog reference with two metadata locations or embedding a table >>>>>>>>>> metadata >>>>>>>>>> location in view metadata. This would cause unnecessary dependence >>>>>>>>>> between >>>>>>>>>> the view and table in catalogs. I guess there’s an argument that you >>>>>>>>>> could >>>>>>>>>> load both table and view metadata locations at the same time. That >>>>>>>>>> hardly >>>>>>>>>> seems worth the trouble given the recent issues with adding views to >>>>>>>>>> the >>>>>>>>>> JDBC catalog. >>>>>>>>>> >>>>>>>>>> I also think that once we decide on structure, we can make it >>>>>>>>>> possible for REST catalog implementations to do smart things, in a >>>>>>>>>> way that >>>>>>>>>> doesn’t put additional requirements on the underlying catalog store. >>>>>>>>>> For >>>>>>>>>> instance, we could specify how to send additional objects in a >>>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table >>>>>>>>>> metadata. I >>>>>>>>>> think these optimizations are a later addition, after we define the >>>>>>>>>> relationship between views and tables. >>>>>>>>>> >>>>>>>>>> Jack, it sounds like you’re the proponent of a combined table and >>>>>>>>>> view (rather than a new metadata spec for a materialized view). What >>>>>>>>>> is the >>>>>>>>>> main motivation? It seems like you’re convinced of that approach, >>>>>>>>>> but I >>>>>>>>>> don’t understand the advantage it brings. >>>>>>>>>> >>>>>>>>>> Ryan >>>>>>>>>> >>>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho < >>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi >>>>>>>>>>> >>>>>>>>>>> Yes I mostly agree with the assessment. To clarify a few minor >>>>>>>>>>> points. >>>>>>>>>>> >>>>>>>>>>> is a materialized view a view and a separate table, a >>>>>>>>>>>> combination of the two (i.e. commits are combined), or a new >>>>>>>>>>>> metadata type? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> For 'new metadata type', I consider mostly Jack's initial >>>>>>>>>>> proposal of a new Catalog MV object that has two references >>>>>>>>>>> (ViewMetadata + >>>>>>>>>>> TableMetadata). >>>>>>>>>>> >>>>>>>>>>> The arguments that I see for a combined materialized view object >>>>>>>>>>>> are: >>>>>>>>>>>> >>>>>>>>>>>> - Regular views are separate, rather than being tables with >>>>>>>>>>>> SQL and no data so it would be inconsistent (“Iceberg view is >>>>>>>>>>>> just a table >>>>>>>>>>>> with no data but with representations defined. But we did not >>>>>>>>>>>> do that.”) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - Materialized views are different objects in DDL >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - Tables may be a superset of functionality needed for >>>>>>>>>>>> materialized views >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - Tables are not typically exposed to end users — but this >>>>>>>>>>>> isn’t required by the separate view and table option >>>>>>>>>>>> >>>>>>>>>>>> For completeness, there seem to be a few additional ones >>>>>>>>>>> (mentioned in the Slack and above messages). >>>>>>>>>>> >>>>>>>>>>> - Lack of spec change (to ViewMetadata). But as Jack says >>>>>>>>>>> it is a spec change (ie, to catalogs) >>>>>>>>>>> - A single call to get the View's StorageTable (versus two >>>>>>>>>>> calls) >>>>>>>>>>> - A more natural API, no opportunity for user to call >>>>>>>>>>> Catalog.dropTable() and renameTable() on storage table >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Thoughts: *I think the long discussion sessions we had on >>>>>>>>>>> Slack was fruitful for me, as seeing the API clarified some things. >>>>>>>>>>> >>>>>>>>>>> I was initially more in favor of MV being a new metadata type >>>>>>>>>>> (TableMetadata + ViewMetadata). But seeing most of the MV >>>>>>>>>>> operations end >>>>>>>>>>> up being ViewCatalog or Catalog operations, I am starting to think >>>>>>>>>>> API-wise >>>>>>>>>>> that it may not align with the new metadata type (unless we define >>>>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate >>>>>>>>>>> wrappers). >>>>>>>>>>> >>>>>>>>>>> Initially one question I had for option 'a view and a separate >>>>>>>>>>> table', was how to make this table reference (metadata.json or >>>>>>>>>>> catalog >>>>>>>>>>> reference). In the previous option, we had a precedent of Catalog >>>>>>>>>>> references to Metadata, but not pointers between Metadatas. I >>>>>>>>>>> initially >>>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' >>>>>>>>>>> catalog >>>>>>>>>>> concerns in ViewMetadata. (I saw Catalog and ViewCatalog as a >>>>>>>>>>> layer above >>>>>>>>>>> TableMetadata and ViewMetadata). But I think Dan in the Slack made >>>>>>>>>>> a fair >>>>>>>>>>> point that ViewMetadata already is tightly bound with a Catalog. >>>>>>>>>>> In this >>>>>>>>>>> case, I think this approach does have its merits as well in aligning >>>>>>>>>>> Catalog API's with the metadata. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Szehon >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul >>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> >>>>>>>>>>>> I would like to provide my perspective on the question of what >>>>>>>>>>>> a materialized view is and elaborate on Jack's recent proposal to >>>>>>>>>>>> view a >>>>>>>>>>>> materialized view as a catalog concept. >>>>>>>>>>>> >>>>>>>>>>>> Firstly, let's look at the role of the catalog. Every entity in >>>>>>>>>>>> the catalog has a *unique identifier*, and the catalog >>>>>>>>>>>> provides methods to create, load, and update these entities. An >>>>>>>>>>>> important >>>>>>>>>>>> thing to note is that the catalog methods exhibit two different >>>>>>>>>>>> behaviors: >>>>>>>>>>>> the *create and load methods deal with the entire entity*, >>>>>>>>>>>> while the *update(commit) method only deals with partial >>>>>>>>>>>> changes* to the entities. >>>>>>>>>>>> >>>>>>>>>>>> In the context of our current discussion, materialized view >>>>>>>>>>>> (MV) metadata is a union of view and table metadata. The fact that >>>>>>>>>>>> the >>>>>>>>>>>> update method deals only with partial changes, enables us to *reuse >>>>>>>>>>>> the existing methods for updating tables and views*. For >>>>>>>>>>>> updates we don't have to define what constitutes an entire >>>>>>>>>>>> materialized >>>>>>>>>>>> view. Changes to a materialized view targeting the properties >>>>>>>>>>>> related to >>>>>>>>>>>> the view metadata could use the update(commit) view method. >>>>>>>>>>>> Similarly, >>>>>>>>>>>> changes targeting the properties related to the table metadata >>>>>>>>>>>> could use >>>>>>>>>>>> the update(commit) table method. This is great news because we >>>>>>>>>>>> don't have >>>>>>>>>>>> to redefine view and table commits (requirements, updates). >>>>>>>>>>>> This is shown in the fact that Jack uses the same operation to >>>>>>>>>>>> update the storage table for Option 1 and 3: >>>>>>>>>>>> >>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true >>>>>>>>>>>> // non-REST: update JSON files at table_metadata_location >>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>>>>>> >>>>>>>>>>>> The open question is *whether the create and load methods >>>>>>>>>>>> should treat the properties that constitute the MV metadata as two >>>>>>>>>>>> entities >>>>>>>>>>>> (View + Table) or one entity (new MV object)*. This is all >>>>>>>>>>>> part of Jack's proposal, where Option 1 proposes a new MV object, >>>>>>>>>>>> and >>>>>>>>>>>> Option 3 proposes two separate entities. The advantage of Option 1 >>>>>>>>>>>> is that >>>>>>>>>>>> it doesn't require two operations to load the metadata. On the >>>>>>>>>>>> other hand, >>>>>>>>>>>> the advantage of Option 3 is that no new operations or catalogs >>>>>>>>>>>> have to be >>>>>>>>>>>> defined. >>>>>>>>>>>> >>>>>>>>>>>> In my opinion, defining a new representation for materialized >>>>>>>>>>>> views (Option 1) is generally the cleaner solution. However, I see >>>>>>>>>>>> a path >>>>>>>>>>>> where we could first introduce Option 3 and still have the >>>>>>>>>>>> possibility to >>>>>>>>>>>> transition to Option 1 if needed. The great thing about Option 3 >>>>>>>>>>>> is that it >>>>>>>>>>>> only requires minor changes to the current spec and is mostly >>>>>>>>>>>> implementation detail. >>>>>>>>>>>> >>>>>>>>>>>> Therefore I would propose small additions to Jacks Option 3 >>>>>>>>>>>> that only introduce changes to the spec that are not specific to >>>>>>>>>>>> materialized views. The idea is to introduce boolean properties to >>>>>>>>>>>> be set >>>>>>>>>>>> on the creation of the view and the storage table that indicate >>>>>>>>>>>> that they >>>>>>>>>>>> belong to a materialized view. The view property "materialized" is >>>>>>>>>>>> set to >>>>>>>>>>>> "true" for a MV and "false" for a regular view. And the table >>>>>>>>>>>> property >>>>>>>>>>>> "storage_table" is set to "true" for a storage table and "false" >>>>>>>>>>>> for a >>>>>>>>>>>> regular table. The absence of these properties indicates a regular >>>>>>>>>>>> view or >>>>>>>>>>>> table. >>>>>>>>>>>> >>>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog; >>>>>>>>>>>> >>>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1 >>>>>>>>>>>> // non-REST: load JSON file at metadata_location >>>>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", >>>>>>>>>>>> "mv1")); >>>>>>>>>>>> >>>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1 >>>>>>>>>>>> // non-REST: load JSON file at table_metadata_location if >>>>>>>>>>>> present >>>>>>>>>>>> Table storageTable = view.storageTable(); >>>>>>>>>>>> >>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1 >>>>>>>>>>>> // non-REST: update JSON file at table_metadata_location >>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit(); >>>>>>>>>>>> >>>>>>>>>>>> We could then introduce a new requirement for views and tables >>>>>>>>>>>> called "AssertProperty" which could make sure to only perform >>>>>>>>>>>> updates that >>>>>>>>>>>> are inline with materialized views. The additional requirement can >>>>>>>>>>>> be seen >>>>>>>>>>>> as a general extension which does not need to be changed if we >>>>>>>>>>>> decide to >>>>>>>>>>>> got with Option 1 in the future. >>>>>>>>>>>> >>>>>>>>>>>> Let me know what you think. >>>>>>>>>>>> >>>>>>>>>>>> Best wishes, >>>>>>>>>>>> >>>>>>>>>>>> Jan >>>>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote: >>>>>>>>>>>> >>>>>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing >>>>>>>>>>>> metadata definitions and minimizing spec changes are very >>>>>>>>>>>> important. This >>>>>>>>>>>> also minimizes spec drift (between materialized views and views >>>>>>>>>>>> spec, and >>>>>>>>>>>> between materialized views and tables spec), and simplifies the >>>>>>>>>>>> implementation. >>>>>>>>>>>> >>>>>>>>>>>> In an effort to take the discussion forward with concrete >>>>>>>>>>>> design options based on an end-to-end implementation, I have >>>>>>>>>>>> prototyped the >>>>>>>>>>>> implementation (and added Spark support) in this PR >>>>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps >>>>>>>>>>>> us reach convergence faster. More details about some of the design >>>>>>>>>>>> options >>>>>>>>>>>> are discussed in the description of the PR. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Walaa. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I mean separate table and view metadata that is somehow >>>>>>>>>>>>> combined through a commit process. For instance, keeping a >>>>>>>>>>>>> pointer to a >>>>>>>>>>>>> table metadata file in a view metadata file or combining commits >>>>>>>>>>>>> to >>>>>>>>>>>>> reference both. I don't see the value in either option. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks Ryan for the help to trace back to the root question! >>>>>>>>>>>>>> Just a clarification question regarding your reply before I >>>>>>>>>>>>>> reply further: >>>>>>>>>>>>>> what exactly does the option "a combination of the two (i.e. >>>>>>>>>>>>>> commits are >>>>>>>>>>>>>> combined)" mean? How is that different from "a new metadata >>>>>>>>>>>>>> type"? >>>>>>>>>>>>>> >>>>>>>>>>>>>> -Jack >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’m catching up on this conversation, so hopefully I can >>>>>>>>>>>>>>> bring a fresh perspective. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Jack already pointed out that we need to start from the >>>>>>>>>>>>>>> basics and I agree with that. Let’s remove voting at this >>>>>>>>>>>>>>> point. Right now >>>>>>>>>>>>>>> is the time for discussing trade-offs, not lining up and taking >>>>>>>>>>>>>>> sides. I >>>>>>>>>>>>>>> realize that wasn’t the intent with adding a vote, but that’s >>>>>>>>>>>>>>> almost always >>>>>>>>>>>>>>> the result. It’s too easy to use it as a stand-in for consensus >>>>>>>>>>>>>>> and move on >>>>>>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that >>>>>>>>>>>>>>> discussion >>>>>>>>>>>>>>> has moved ahead of agreement. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We’re still at the most basic question: is a materialized >>>>>>>>>>>>>>> view a view and a separate table, a combination of the two >>>>>>>>>>>>>>> (i.e. commits >>>>>>>>>>>>>>> are combined), or a new metadata type? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some >>>>>>>>>>>>>>> kind of “system table” (meaning hidden?) or if it is exposed in >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> catalog. That’s a later choice (already pointed out) and, I >>>>>>>>>>>>>>> suspect, it >>>>>>>>>>>>>>> should be delegated to catalog implementations. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To simplify this a little, I think that we can eliminate the >>>>>>>>>>>>>>> option to combine table and view commits. I don’t think there >>>>>>>>>>>>>>> is a reason >>>>>>>>>>>>>>> to combine the two. If separate, a table would track the view >>>>>>>>>>>>>>> version used >>>>>>>>>>>>>>> along with freshness information for referenced tables. If the >>>>>>>>>>>>>>> table is >>>>>>>>>>>>>>> automatically skipped when the version no longer matches the >>>>>>>>>>>>>>> view, then no >>>>>>>>>>>>>>> action needs to happen when a view definition changes. >>>>>>>>>>>>>>> Similarly, the table >>>>>>>>>>>>>>> can be updated independentl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>