Re: Materialized view integration with REST spec

himadri pal Tue, 05 Mar 2024 07:27:59 -0800

For me the calendar link did not work in mobile, but I was able to add the
dev Google calendar from
https://iceberg.apache.org/community/#iceberg-community-events by accessing
it from  laptop.


Regards,
Himadri Pal


On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Thanks Jack! I think the images are stripped from the message, but they
> are there on the doc
> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>  if
> someone wants to check them out (I have left some comments while there).
>
> Also I no longer see the community sync calendar
> https://iceberg.apache.org/community/#slack, so it is unclear when the
> meeting is (and we do not have the link).
>
> Thanks,
> Walaa.
>
>
> On Mon, Mar 4, 2024 at 9:58 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Thanks Jan! +1 for everyone to take a look before the discussion, and see
>> if there are any missing options or major arguments.
>>
>> I have also added the images regarding all the options, it might be
>> easier to parse than the big sheet. I will also put it here for people that
>> do not have time to read through it:
>>
>>
>> *Option 1: Add storage table identifier in view metadata content*
>>
>> [image: MV option 1.png]
>> *Option 2: Add storage table metadata file pointer in view object*
>>
>> [image: MV option 2.png]
>> *Option 3: Add storage table metadata file pointer in view metadata
>> content*
>>
>> [image: MV option 3.png]
>>
>> *Option 4: Embed table metadata in view metadata content*
>>
>> [image: MV option 4.png]
>> *Option 5: New MV spec, MV object has table and view metadata file
>> pointers*
>>
>> [image: MV option 5.png]
>> *Option 6: New MV spec, MV metadata content embeds table and view
>> metadata*
>>
>> [image: MV option 6.png]
>> *Option 7: New MV spec, completely new MV metadata content*
>>
>> [image: MV option 7.png]
>>
>> -Jack
>>
>>
>> On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul <jank...@mailbox.org.invalid>
>> wrote:
>>
>>> I think it's great to have a face to face discussion about this.
>>> Additionally, I would propose to use Jacks' document
>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>> as a common ground for the discussion and that everyone has a quick look
>>> before the next community sync. If you think the document is still missing
>>> some arguments, please make suggestions to add them. This way we have to
>>> spend less time to get everyone up to speed and have a more common
>>> terminology.
>>>
>>> Looking forward to the discussion, best wishes
>>>
>>> Jan
>>> On 02.03.24 02:06, Walaa Eldin Moustafa wrote:
>>>
>>> The calendar on the site is currently broken
>>> https://iceberg.apache.org/community/#iceberg-community-events. Might
>>> help to fix it or share the meeting link here.
>>>
>>> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Sounds good, let's discuss this in person!
>>>>
>>>> I am a bit worried that we have quite a few critical topics going on
>>>> right now on devlist, and this will take up a lot of time to discuss. If it
>>>> ends up going for too long, l propose let us have a dedicated meeting, and
>>>> I am more than happy to organize it.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> I think this thread has hit a point of diminishing returns and that we
>>>>> still don't have a common understanding of what the options under
>>>>> consideration actually are.
>>>>>
>>>>> Since we were already planning on discussing this at the next
>>>>> community sync, I suggest we pick this up there and use that time to align
>>>>> on what exactly we're considering. We can then start a new thread to lay
>>>>> out the designs under consideration in more detail and then have a
>>>>> discussion about trade-offs.
>>>>>
>>>>> Does that sound reasonable?
>>>>>
>>>>> Ryan
>>>>>
>>>>>
>>>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> I am finding it hard to interpret the options concretely. I would
>>>>>> also suggest breaking the expectation/outcome to milestones. Maybe it
>>>>>> becomes easier if we agree to distinguish between an approach that is
>>>>>> feasible in the near term and another in the long term, especially if the
>>>>>> latter requires significant engine-side changes.
>>>>>>
>>>>>> Further, maybe it helps if we start with an option that fully reuses
>>>>>> the existing spec, and see how we view it in comparison with the options
>>>>>> discussed previously. I am sharing one below. It reuses the current spec 
>>>>>> of
>>>>>> Iceberg views and tables by leveraging table properties to capture
>>>>>> materialized view metadata. What is common (and not common) between this
>>>>>> and the desired representations?
>>>>>>
>>>>>> The new properties are:
>>>>>> Properties on a View:
>>>>>>
>>>>>>    1.
>>>>>>
>>>>>>    *iceberg.materialized.view*:
>>>>>>    - *Type*: View property
>>>>>>       - *Purpose*: This property is used to mark whether a view is a
>>>>>>       materialized view. If set to true, the view is treated as a
>>>>>>       materialized view. This helps in differentiating between virtual 
>>>>>> and
>>>>>>       materialized views within the catalog and dictates specific 
>>>>>> handling and
>>>>>>       validation logic for materialized views.
>>>>>>    2.
>>>>>>
>>>>>>    *iceberg.materialized.view.storage.location*:
>>>>>>    - *Type*: View property
>>>>>>       - *Purpose*: Specifies the location of the storage table
>>>>>>       associated with the materialized view. This property is used for 
>>>>>> linking a
>>>>>>       materialized view with its corresponding storage table, enabling 
>>>>>> data
>>>>>>       management and query execution based on the stored data freshness.
>>>>>>
>>>>>> Properties on a Table:
>>>>>>
>>>>>>    1. *base.snapshot.[UUID]*:
>>>>>>       - *Type*: Table property
>>>>>>       - *Purpose*: These properties store the snapshot IDs of the
>>>>>>       base tables at the time the materialized view's data was last 
>>>>>> updated. Each
>>>>>>       property is prefixed with base.snapshot. followed by the UUID
>>>>>>       of the base table. They are used to track whether the materialized 
>>>>>> view's
>>>>>>       data is up to date with the base tables by comparing these 
>>>>>> snapshot IDs
>>>>>>       with the current snapshot IDs of the base tables. If all the base 
>>>>>> tables'
>>>>>>       current snapshot IDs match the ones stored in these properties, the
>>>>>>       materialized view's data is considered fresh.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>>> > All of these approaches are aligned in one, specific way: the
>>>>>>> storage table is an iceberg table.
>>>>>>>
>>>>>>> I do not think that is true. I think people are aligned that we
>>>>>>> would like to re-use the Iceberg table metadata defined in the Iceberg
>>>>>>> table spec to express the data in MV, but I don't think it goes that 
>>>>>>> far to
>>>>>>> say it must be an Iceberg table. Once you have that mindset, then of 
>>>>>>> course
>>>>>>> option 1 (separate table and view) is the only option.
>>>>>>>
>>>>>>> > I don't think that is necessary and it significantly increases the
>>>>>>> complexity.
>>>>>>>
>>>>>>> And can you quantify what you mean by "significantly increases the
>>>>>>> complexity"? Seems like a lot of concerns are coming from the tradeoff 
>>>>>>> with
>>>>>>> complexity. We probably all agree that using option 7 (a completely new
>>>>>>> metadata type) is a lot of work from scratch, that is why it is not
>>>>>>> favored. However, my understanding is that as long as we re-use the view
>>>>>>> and table metadata, then the majority of the existing logic can be 
>>>>>>> reused.
>>>>>>> I think what we have gone through in Slack to draft the rough Java API
>>>>>>> shape helps here, because people can estimate the amount of effort 
>>>>>>> required
>>>>>>> to implement it. And I don't think they are **significantly** more 
>>>>>>> complex
>>>>>>> to implement. Could you elaborate more about the complexity that you
>>>>>>> imagine?
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <
>>>>>>> daniel.c.we...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I feel I've been most vocal about pushing back against options 2+
>>>>>>>> (or Ryan's categories of combined table/view, or new metadata type), so
>>>>>>>> I'll try to expand on my reasoning.
>>>>>>>>
>>>>>>>> I understand the appeal of creating a design where we encapsulate
>>>>>>>> the view/storage from both a structural and performance standpoint, 
>>>>>>>> but I
>>>>>>>> don't think that is necessary and it significantly increases the 
>>>>>>>> complexity.
>>>>>>>>
>>>>>>>> All of these approaches are aligned in one, specific way: the
>>>>>>>> storage table is an iceberg table.
>>>>>>>>
>>>>>>>> Because of this, all the behaviors and requirements still apply to
>>>>>>>> these tables.  They need to be maintained (snapshot cleanup, orphan 
>>>>>>>> files),
>>>>>>>> in cases need to be optimized (compaction, manifest rewrites), they 
>>>>>>>> need to
>>>>>>>> be able to be inspected (this will be even more important with MV since
>>>>>>>> staleness can produce different results and questions will arise about 
>>>>>>>> what
>>>>>>>> state the storage table was in).  There may be cases where the tables 
>>>>>>>> need
>>>>>>>> to be managed directly.
>>>>>>>>
>>>>>>>> Anywhere we deviate from the existing constructs/commit/access for
>>>>>>>> tables, we will ultimately have to then unwrap to re-expose the 
>>>>>>>> underlying
>>>>>>>> Iceberg behavior.  This creates unnecessary complexity in the 
>>>>>>>> library/API
>>>>>>>> layer, which are not the primary interface users will have with
>>>>>>>> materialized views where an engine is almost entirely necessary to 
>>>>>>>> interact
>>>>>>>> with the dataset.
>>>>>>>>
>>>>>>>> As to the performance concerns around option 1, I think we're
>>>>>>>> overstating the downsides.  It really comes down to how many metadata 
>>>>>>>> loads
>>>>>>>> are necessary and evaluating freshness would likely be the real 
>>>>>>>> bottleneck
>>>>>>>> as it involves potentially loading many tables.  All of the options 
>>>>>>>> are on
>>>>>>>> the same order of performance for the metadata and table loads.
>>>>>>>>
>>>>>>>> As to the visibility of tables and whether they're registered in
>>>>>>>> the catalog, I think registering in the catalog is the right approach 
>>>>>>>> so
>>>>>>>> that the tables are still addressable for maintenance/etc.  The 
>>>>>>>> visibility
>>>>>>>> of the storage table is a catalog implementation decision and 
>>>>>>>> shouldn't be
>>>>>>>> a requirement of the MV spec (I can see cases for both and it isn't
>>>>>>>> necessary to dictate a behavior).
>>>>>>>>
>>>>>>>> I'm still strongly in favor of Option 1 (separate table and view)
>>>>>>>> for these reasons.
>>>>>>>>
>>>>>>>> -Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> > Jack, it sounds like you’re the proponent of a combined table
>>>>>>>>> and view (rather than a new metadata spec for a materialized view). 
>>>>>>>>> What is
>>>>>>>>> the main motivation? It seems like you’re convinced of that approach, 
>>>>>>>>> but I
>>>>>>>>> don’t understand the advantage it brings.
>>>>>>>>>
>>>>>>>>> Sorry I have to make a Google Sheet to capture all the options we
>>>>>>>>> have discussed so far, I wanted to use the existing Google Doc, but 
>>>>>>>>> it has
>>>>>>>>> really bad table/sheet support...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>>>>>>
>>>>>>>>> I have listed all the options, with how they are implemented and
>>>>>>>>> some important considerations we have discussed so far. Note that:
>>>>>>>>> 1. This sheet currently excludes the lineage information, which we
>>>>>>>>> can discuss more later after the current topic is resolved.
>>>>>>>>> 2. I removed the considerations for REST integration since from
>>>>>>>>> the other thread we have clarified that they should be considered
>>>>>>>>> completely separately.
>>>>>>>>>
>>>>>>>>> *Why I come as a proponent of having a new MV object with table
>>>>>>>>> and view metadata file pointer*
>>>>>>>>>
>>>>>>>>> In my sheet, there are 3 options that do not have major problems:
>>>>>>>>> Option 2: Add storage table metadata file pointer in view object
>>>>>>>>> Option 5: New MV object with table and view metadata file pointer
>>>>>>>>> Option 6: New MV spec with table and view metadata
>>>>>>>>>
>>>>>>>>> I originally excluded option 2 because I think it does not align
>>>>>>>>> with the REST spec, but after the other discussion thread about 
>>>>>>>>> "Inconsistency
>>>>>>>>> between REST spec and table/view spec", I think my original concern no
>>>>>>>>> longer holds true so now I put it back. And based on my personal
>>>>>>>>> preference that MV is an independent object that should be separated 
>>>>>>>>> from
>>>>>>>>> view and table, plus the fact that option 5 is probably less work than
>>>>>>>>> option 6 for implementation, that is how I come as a proponent of 
>>>>>>>>> option 5
>>>>>>>>> at this moment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Regarding Ryan's evaluation framework *
>>>>>>>>>
>>>>>>>>> I think we need to reconcile this sheet with Ryan's evaluation
>>>>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 all
>>>>>>>>> under the same category of "A combination of a view and a table"
>>>>>>>>> and concludes that they don't have any advantage for the same set of
>>>>>>>>> reasons. But those reasons are not really convincing to me so let's 
>>>>>>>>> talk
>>>>>>>>> about them in more detail.
>>>>>>>>>
>>>>>>>>> (1) You said "I don’t see a reason why a combined view and table
>>>>>>>>> is advantageous" as "this would cause unnecessary dependence between 
>>>>>>>>> the
>>>>>>>>> view and table in catalogs."  What dependency exactly do you mean 
>>>>>>>>> here? And
>>>>>>>>> why is that unnecessary, given there has to be some sort of dependency
>>>>>>>>> anyway unless we go with option 5 or 6?
>>>>>>>>>
>>>>>>>>> (2) You said "I guess there’s an argument that you could load both
>>>>>>>>> table and view metadata locations at the same time. That hardly seems 
>>>>>>>>> worth
>>>>>>>>> the trouble". I disagree with that. Catalog interaction performance is
>>>>>>>>> critical to at least everyone working in EMR and Athena, and MV 
>>>>>>>>> itself as
>>>>>>>>> an acceleration approach needs to be as fast as possible.
>>>>>>>>>
>>>>>>>>> I have put 3 key operations in the doc that I think matters for MV
>>>>>>>>> during interactions with engine:
>>>>>>>>> 1. refreshes storage table
>>>>>>>>> 2. get the storage table of the MV
>>>>>>>>> 3. if stale, get the view SQL
>>>>>>>>>
>>>>>>>>> And option 1 clearly falls short with 4 sequential steps required
>>>>>>>>> to load a storage table. You mentioned "recent issues with adding 
>>>>>>>>> views to
>>>>>>>>> the JDBC catalog" in this topic, could you explain a bit more?
>>>>>>>>>
>>>>>>>>> (3) You said "I also think that once we decide on structure, we
>>>>>>>>> can make it possible for REST catalog implementations to do smart 
>>>>>>>>> things,
>>>>>>>>> in a way that doesn’t put additional requirements on the underlying 
>>>>>>>>> catalog
>>>>>>>>> store." If REST is fully compatible with Iceberg spec then I have no
>>>>>>>>> problem with this statement. However, as we discussed in the other 
>>>>>>>>> thread,
>>>>>>>>> it is not the case. In the current state, I think the sequence of 
>>>>>>>>> action
>>>>>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) 
>>>>>>>>> first,
>>>>>>>>> and then think about how REST can incorporate it or do smart things 
>>>>>>>>> that
>>>>>>>>> are not Iceberg spec compliant. Do you agree with that?
>>>>>>>>>
>>>>>>>>> (4) You said the table identifier pointer "is a problem we need to
>>>>>>>>> solve generally because a materialized table needs to be able to 
>>>>>>>>> track the
>>>>>>>>> upstream state of tables that were used". I don't think that is a 
>>>>>>>>> reason to
>>>>>>>>> choose to use a table identifier pointer for a storage table. The 
>>>>>>>>> issue is
>>>>>>>>> not about using a table identifier pointer. It is about exposing the
>>>>>>>>> storage table as a separate entity in the catalog, which is what 
>>>>>>>>> people do
>>>>>>>>> not like and is already discussed in length in Jan's question 3 (also
>>>>>>>>> linked in the sheet). I agree with that statement, because without a 
>>>>>>>>> REST
>>>>>>>>> implementation that can magically hide the storage table, this model 
>>>>>>>>> adds
>>>>>>>>> additional burden regarding compliance and data governance for any 
>>>>>>>>> other
>>>>>>>>> non-REST catalog implementations that are compliant to the Iceberg 
>>>>>>>>> spec.
>>>>>>>>> Many mechanisms need to be built in a catalog to hide, protect, 
>>>>>>>>> maintain,
>>>>>>>>> recycle the storage table, that can be avoided by using other 
>>>>>>>>> approaches. I
>>>>>>>>> think we should reach a consensus about that and discuss further if 
>>>>>>>>> you do
>>>>>>>>> not agree.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jack Ye
>>>>>>>>>
>>>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul
>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ryan, we actually discussed your categories in this question
>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>>>>>>> Where your categories correspond to the following designs:
>>>>>>>>>>
>>>>>>>>>>    - Separate table and view => Design 1
>>>>>>>>>>    - Combination of view and table => Design 2
>>>>>>>>>>    - A new metadata type => Design 4
>>>>>>>>>>
>>>>>>>>>> Jan
>>>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>>>>>>
>>>>>>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so
>>>>>>>>>> I’ll be more specific:
>>>>>>>>>>
>>>>>>>>>>    - *Separate table and view*: this option is to have the
>>>>>>>>>>    objects that we have today, with extra metadata. Commit processes 
>>>>>>>>>> are
>>>>>>>>>>    separate: committing to the table doesn’t alter the view and 
>>>>>>>>>> committing to
>>>>>>>>>>    the view doesn’t change the table. However, changing the view can 
>>>>>>>>>> make it
>>>>>>>>>>    so the table is no longer useful as a materialization.
>>>>>>>>>>    - *A combination of a view and a table*: in this option, the
>>>>>>>>>>    table metadata and view metadata are the same as the first 
>>>>>>>>>> option. The
>>>>>>>>>>    difference is that the commit process combines them, either by 
>>>>>>>>>> embedding a
>>>>>>>>>>    table metadata location in view metadata or by tracking both in 
>>>>>>>>>> the same
>>>>>>>>>>    catalog reference.
>>>>>>>>>>    - *A new metadata type*: this option is where we define a new
>>>>>>>>>>    metadata object that has view attributes, like SQL 
>>>>>>>>>> representations, along
>>>>>>>>>>    with table attributes, like partition specs and snapshots.
>>>>>>>>>>
>>>>>>>>>> Hopefully this is clear because I think much of the confusion is
>>>>>>>>>> caused by different definitions.
>>>>>>>>>>
>>>>>>>>>> The LoadTableResponse having optional metadata-location field
>>>>>>>>>> implies that the object in the catalog no longer needs to hold a 
>>>>>>>>>> metadata
>>>>>>>>>> file pointer
>>>>>>>>>>
>>>>>>>>>> The REST protocol has not removed the requirement for a metadata
>>>>>>>>>> file, so I’m going to keep focused on the MV design options.
>>>>>>>>>>
>>>>>>>>>> When we say a MV can be a “new metadata type”, it does not mean
>>>>>>>>>> it needs to define a completely brand new structure of the metadata 
>>>>>>>>>> content
>>>>>>>>>>
>>>>>>>>>> I’m making a distinction between separate metadata files for the
>>>>>>>>>> table and the view and a combined metadata object, as above.
>>>>>>>>>>
>>>>>>>>>> We can define an “Iceberg MV” to be an object in a catalog, which
>>>>>>>>>> has 1 table metadata file pointer, and 1 view metadata file pointer
>>>>>>>>>>
>>>>>>>>>> This is the option I am referring to as a “combination of a view
>>>>>>>>>> and a table”.
>>>>>>>>>>
>>>>>>>>>> So to review my initial email, I don’t see a reason why a
>>>>>>>>>> combined view and table is advantageous, either implemented by 
>>>>>>>>>> having a
>>>>>>>>>> catalog reference with two metadata locations or embedding a table 
>>>>>>>>>> metadata
>>>>>>>>>> location in view metadata. This would cause unnecessary dependence 
>>>>>>>>>> between
>>>>>>>>>> the view and table in catalogs. I guess there’s an argument that you 
>>>>>>>>>> could
>>>>>>>>>> load both table and view metadata locations at the same time. That 
>>>>>>>>>> hardly
>>>>>>>>>> seems worth the trouble given the recent issues with adding views to 
>>>>>>>>>> the
>>>>>>>>>> JDBC catalog.
>>>>>>>>>>
>>>>>>>>>> I also think that once we decide on structure, we can make it
>>>>>>>>>> possible for REST catalog implementations to do smart things, in a 
>>>>>>>>>> way that
>>>>>>>>>> doesn’t put additional requirements on the underlying catalog store. 
>>>>>>>>>> For
>>>>>>>>>> instance, we could specify how to send additional objects in a
>>>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table 
>>>>>>>>>> metadata. I
>>>>>>>>>> think these optimizations are a later addition, after we define the
>>>>>>>>>> relationship between views and tables.
>>>>>>>>>>
>>>>>>>>>> Jack, it sounds like you’re the proponent of a combined table and
>>>>>>>>>> view (rather than a new metadata spec for a materialized view). What 
>>>>>>>>>> is the
>>>>>>>>>> main motivation? It seems like you’re convinced of that approach, 
>>>>>>>>>> but I
>>>>>>>>>> don’t understand the advantage it brings.
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <
>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> Yes I mostly agree with the assessment.  To clarify a few minor
>>>>>>>>>>> points.
>>>>>>>>>>>
>>>>>>>>>>> is a materialized view a view and a separate table, a
>>>>>>>>>>>> combination of the two (i.e. commits are combined), or a new 
>>>>>>>>>>>> metadata type?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For 'new metadata type', I consider mostly Jack's initial
>>>>>>>>>>> proposal of a new Catalog MV object that has two references 
>>>>>>>>>>> (ViewMetadata +
>>>>>>>>>>> TableMetadata).
>>>>>>>>>>>
>>>>>>>>>>> The arguments that I see for a combined materialized view object
>>>>>>>>>>>> are:
>>>>>>>>>>>>
>>>>>>>>>>>>    - Regular views are separate, rather than being tables with
>>>>>>>>>>>>    SQL and no data so it would be inconsistent (“Iceberg view is 
>>>>>>>>>>>> just a table
>>>>>>>>>>>>    with no data but with representations defined. But we did not 
>>>>>>>>>>>> do that.”)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>>>>    materialized views
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - Tables are not typically exposed to end users — but this
>>>>>>>>>>>>    isn’t required by the separate view and table option
>>>>>>>>>>>>
>>>>>>>>>>>> For completeness, there seem to be a few additional ones
>>>>>>>>>>> (mentioned in the Slack and above messages).
>>>>>>>>>>>
>>>>>>>>>>>    - Lack of spec change (to ViewMetadata).  But as Jack says
>>>>>>>>>>>    it is a spec change (ie, to catalogs)
>>>>>>>>>>>    - A single call to get the View's StorageTable (versus two
>>>>>>>>>>>    calls)
>>>>>>>>>>>    - A more natural API, no opportunity for user to call
>>>>>>>>>>>    Catalog.dropTable() and renameTable() on storage table
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Thoughts:  *I think the long discussion sessions we had on
>>>>>>>>>>> Slack was fruitful for me, as seeing the API clarified some things.
>>>>>>>>>>>
>>>>>>>>>>> I was initially more in favor of MV being a new metadata type
>>>>>>>>>>> (TableMetadata + ViewMetadata).  But seeing most of the MV 
>>>>>>>>>>> operations end
>>>>>>>>>>> up being ViewCatalog or Catalog operations, I am starting to think 
>>>>>>>>>>> API-wise
>>>>>>>>>>> that it may not align with the new metadata type (unless we define
>>>>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate 
>>>>>>>>>>> wrappers).
>>>>>>>>>>>
>>>>>>>>>>> Initially one question I had for option 'a view and a separate
>>>>>>>>>>> table', was how to make this table reference (metadata.json or 
>>>>>>>>>>> catalog
>>>>>>>>>>> reference).  In the previous option, we had a precedent of Catalog
>>>>>>>>>>> references to Metadata, but not pointers between Metadatas.  I 
>>>>>>>>>>> initially
>>>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' 
>>>>>>>>>>> catalog
>>>>>>>>>>> concerns in ViewMetadata.  (I saw Catalog and ViewCatalog as a 
>>>>>>>>>>> layer above
>>>>>>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack made 
>>>>>>>>>>> a fair
>>>>>>>>>>> point that ViewMetadata already is tightly bound with a Catalog.  
>>>>>>>>>>> In this
>>>>>>>>>>> case, I think this approach does have its merits as well in aligning
>>>>>>>>>>> Catalog API's with the metadata.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Szehon
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to provide my perspective on the question of what
>>>>>>>>>>>> a materialized view is and elaborate on Jack's recent proposal to 
>>>>>>>>>>>> view a
>>>>>>>>>>>> materialized view as a catalog concept.
>>>>>>>>>>>>
>>>>>>>>>>>> Firstly, let's look at the role of the catalog. Every entity in
>>>>>>>>>>>> the catalog has a *unique identifier*, and the catalog
>>>>>>>>>>>> provides methods to create, load, and update these entities. An 
>>>>>>>>>>>> important
>>>>>>>>>>>> thing to note is that the catalog methods exhibit two different 
>>>>>>>>>>>> behaviors:
>>>>>>>>>>>> the *create and load methods deal with the entire entity*,
>>>>>>>>>>>> while the *update(commit) method only deals with partial
>>>>>>>>>>>> changes* to the entities.
>>>>>>>>>>>>
>>>>>>>>>>>> In the context of our current discussion, materialized view
>>>>>>>>>>>> (MV) metadata is a union of view and table metadata. The fact that 
>>>>>>>>>>>> the
>>>>>>>>>>>> update method deals only with partial changes, enables us to *reuse
>>>>>>>>>>>> the existing methods for updating tables and views*. For
>>>>>>>>>>>> updates we don't have to define what constitutes an entire 
>>>>>>>>>>>> materialized
>>>>>>>>>>>> view. Changes to a materialized view targeting the properties 
>>>>>>>>>>>> related to
>>>>>>>>>>>> the view metadata could use the update(commit) view method. 
>>>>>>>>>>>> Similarly,
>>>>>>>>>>>> changes targeting the properties related to the table metadata 
>>>>>>>>>>>> could use
>>>>>>>>>>>> the update(commit) table method. This is great news because we 
>>>>>>>>>>>> don't have
>>>>>>>>>>>> to redefine view and table commits (requirements, updates).
>>>>>>>>>>>> This is shown in the fact that Jack uses the same operation to
>>>>>>>>>>>> update the storage table for Option 1 and 3:
>>>>>>>>>>>>
>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
>>>>>>>>>>>> // non-REST: update JSON files at table_metadata_location
>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>
>>>>>>>>>>>> The open question is *whether the create and load methods
>>>>>>>>>>>> should treat the properties that constitute the MV metadata as two 
>>>>>>>>>>>> entities
>>>>>>>>>>>> (View + Table) or one entity (new MV object)*. This is all
>>>>>>>>>>>> part of Jack's proposal, where Option 1 proposes a new MV object, 
>>>>>>>>>>>> and
>>>>>>>>>>>> Option 3 proposes two separate entities. The advantage of Option 1 
>>>>>>>>>>>> is that
>>>>>>>>>>>> it doesn't require two operations to load the metadata. On the 
>>>>>>>>>>>> other hand,
>>>>>>>>>>>> the advantage of Option 3 is that no new operations or catalogs 
>>>>>>>>>>>> have to be
>>>>>>>>>>>> defined.
>>>>>>>>>>>>
>>>>>>>>>>>> In my opinion, defining a new representation for materialized
>>>>>>>>>>>> views (Option 1) is generally the cleaner solution. However, I see 
>>>>>>>>>>>> a path
>>>>>>>>>>>> where we could first introduce Option 3 and still have the 
>>>>>>>>>>>> possibility to
>>>>>>>>>>>> transition to Option 1 if needed. The great thing about Option 3 
>>>>>>>>>>>> is that it
>>>>>>>>>>>> only requires minor changes to the current spec and is mostly
>>>>>>>>>>>> implementation detail.
>>>>>>>>>>>>
>>>>>>>>>>>> Therefore I would propose small additions to Jacks Option 3
>>>>>>>>>>>> that only introduce changes to the spec that are not specific to
>>>>>>>>>>>> materialized views. The idea is to introduce boolean properties to 
>>>>>>>>>>>> be set
>>>>>>>>>>>> on the creation of the view and the storage table that indicate 
>>>>>>>>>>>> that they
>>>>>>>>>>>> belong to a materialized view. The view property "materialized" is 
>>>>>>>>>>>> set to
>>>>>>>>>>>> "true" for a MV and "false" for a regular view. And the table 
>>>>>>>>>>>> property
>>>>>>>>>>>> "storage_table" is set to "true" for a storage table and "false" 
>>>>>>>>>>>> for a
>>>>>>>>>>>> regular table. The absence of these properties indicates a regular 
>>>>>>>>>>>> view or
>>>>>>>>>>>> table.
>>>>>>>>>>>>
>>>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>>>>>>
>>>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1",
>>>>>>>>>>>> "mv1"));
>>>>>>>>>>>>
>>>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>>>>>>> // non-REST: load JSON file at table_metadata_location if
>>>>>>>>>>>> present
>>>>>>>>>>>> Table storageTable = view.storageTable();
>>>>>>>>>>>>
>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>>>>>>> // non-REST: update JSON file at table_metadata_location
>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>
>>>>>>>>>>>> We could then introduce a new requirement for views and tables
>>>>>>>>>>>> called "AssertProperty" which could make sure to only perform 
>>>>>>>>>>>> updates that
>>>>>>>>>>>> are inline with materialized views. The additional requirement can 
>>>>>>>>>>>> be seen
>>>>>>>>>>>> as a general extension which does not need to be changed if we 
>>>>>>>>>>>> decide to
>>>>>>>>>>>> got with Option 1 in the future.
>>>>>>>>>>>>
>>>>>>>>>>>> Let me know what you think.
>>>>>>>>>>>>
>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>
>>>>>>>>>>>> Jan
>>>>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing
>>>>>>>>>>>> metadata definitions and minimizing spec changes are very 
>>>>>>>>>>>> important. This
>>>>>>>>>>>> also minimizes spec drift (between materialized views and views 
>>>>>>>>>>>> spec, and
>>>>>>>>>>>> between materialized views and tables spec), and simplifies the
>>>>>>>>>>>> implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> In an effort to take the discussion forward with concrete
>>>>>>>>>>>> design options based on an end-to-end implementation, I have 
>>>>>>>>>>>> prototyped the
>>>>>>>>>>>> implementation (and added Spark support) in this PR
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps
>>>>>>>>>>>> us reach convergence faster. More details about some of the design 
>>>>>>>>>>>> options
>>>>>>>>>>>> are discussed in the description of the PR.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I mean separate table and view metadata that is somehow
>>>>>>>>>>>>> combined through a commit process. For instance, keeping a 
>>>>>>>>>>>>> pointer to a
>>>>>>>>>>>>> table metadata file in a view metadata file or combining commits 
>>>>>>>>>>>>> to
>>>>>>>>>>>>> reference both. I don't see the value in either option.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Ryan for the help to trace back to the root question!
>>>>>>>>>>>>>> Just a clarification question regarding your reply before I 
>>>>>>>>>>>>>> reply further:
>>>>>>>>>>>>>> what exactly does the option "a combination of the two (i.e. 
>>>>>>>>>>>>>> commits are
>>>>>>>>>>>>>> combined)" mean? How is that different from "a new metadata 
>>>>>>>>>>>>>> type"?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I’m catching up on this conversation, so hopefully I can
>>>>>>>>>>>>>>> bring a fresh perspective.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jack already pointed out that we need to start from the
>>>>>>>>>>>>>>> basics and I agree with that. Let’s remove voting at this 
>>>>>>>>>>>>>>> point. Right now
>>>>>>>>>>>>>>> is the time for discussing trade-offs, not lining up and taking 
>>>>>>>>>>>>>>> sides. I
>>>>>>>>>>>>>>> realize that wasn’t the intent with adding a vote, but that’s 
>>>>>>>>>>>>>>> almost always
>>>>>>>>>>>>>>> the result. It’s too easy to use it as a stand-in for consensus 
>>>>>>>>>>>>>>> and move on
>>>>>>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that 
>>>>>>>>>>>>>>> discussion
>>>>>>>>>>>>>>> has moved ahead of agreement.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We’re still at the most basic question: is a materialized
>>>>>>>>>>>>>>> view a view and a separate table, a combination of the two 
>>>>>>>>>>>>>>> (i.e. commits
>>>>>>>>>>>>>>> are combined), or a new metadata type?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some
>>>>>>>>>>>>>>> kind of “system table” (meaning hidden?) or if it is exposed in 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> catalog. That’s a later choice (already pointed out) and, I 
>>>>>>>>>>>>>>> suspect, it
>>>>>>>>>>>>>>> should be delegated to catalog implementations.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To simplify this a little, I think that we can eliminate the
>>>>>>>>>>>>>>> option to combine table and view commits. I don’t think there 
>>>>>>>>>>>>>>> is a reason
>>>>>>>>>>>>>>> to combine the two. If separate, a table would track the view 
>>>>>>>>>>>>>>> version used
>>>>>>>>>>>>>>> along with freshness information for referenced tables. If the 
>>>>>>>>>>>>>>> table is
>>>>>>>>>>>>>>> automatically skipped when the version no longer matches the 
>>>>>>>>>>>>>>> view, then no
>>>>>>>>>>>>>>> action needs to happen when a view definition changes. 
>>>>>>>>>>>>>>> Similarly, the table
>>>>>>>>>>>>>>> can be updated independentl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to