Re: Materialized view integration with REST spec

Jean-Baptiste Onofré Fri, 22 Mar 2024 01:44:15 -0700

Hi Renjie,

We discussed the MV proposal, without yet reaching any conclusion.


I propose:
- to use the "new" proposal process in place (creating an GH issue with
proposal flag, with link to the document)
- use the document and/or GH issue to add comments
- finalize the document heading to a vote (to get consensus)

Thoughts ?

NB: I will follow up with "stale PR/proposal" PR to be sure we are moving
forward ;)

Regards
JB

On Fri, Mar 22, 2024 at 4:29 AM Renjie Liu <liurenjie2...@gmail.com> wrote:

> Hi:
>
> Sorry I didn't make it to join the last community sync. Did we reach any
> conclusion about mv spec?
>
> On Tue, Mar 5, 2024 at 11:28 PM himadri pal <meh...@gmail.com> wrote:
>
>> For me the calendar link did not work in mobile, but I was able to add
>> the dev Google calendar from
>> https://iceberg.apache.org/community/#iceberg-community-events by
>> accessing it from  laptop.
>>
>> Regards,
>> Himadri Pal
>>
>>
>> On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Thanks Jack! I think the images are stripped from the message, but they
>>> are there on the doc
>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>  if
>>> someone wants to check them out (I have left some comments while there).
>>>
>>> Also I no longer see the community sync calendar
>>> https://iceberg.apache.org/community/#slack, so it is unclear when the
>>> meeting is (and we do not have the link).
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Mon, Mar 4, 2024 at 9:58 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Thanks Jan! +1 for everyone to take a look before the discussion, and
>>>> see if there are any missing options or major arguments.
>>>>
>>>> I have also added the images regarding all the options, it might be
>>>> easier to parse than the big sheet. I will also put it here for people that
>>>> do not have time to read through it:
>>>>
>>>>
>>>> *Option 1: Add storage table identifier in view metadata content*
>>>>
>>>> [image: MV option 1.png]
>>>> *Option 2: Add storage table metadata file pointer in view object*
>>>>
>>>> [image: MV option 2.png]
>>>> *Option 3: Add storage table metadata file pointer in view metadata
>>>> content*
>>>>
>>>> [image: MV option 3.png]
>>>>
>>>> *Option 4: Embed table metadata in view metadata content*
>>>>
>>>> [image: MV option 4.png]
>>>> *Option 5: New MV spec, MV object has table and view metadata file
>>>> pointers*
>>>>
>>>> [image: MV option 5.png]
>>>> *Option 6: New MV spec, MV metadata content embeds table and view
>>>> metadata*
>>>>
>>>> [image: MV option 6.png]
>>>> *Option 7: New MV spec, completely new MV metadata content*
>>>>
>>>> [image: MV option 7.png]
>>>>
>>>> -Jack
>>>>
>>>>
>>>> On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul <jank...@mailbox.org.invalid>
>>>> wrote:
>>>>
>>>>> I think it's great to have a face to face discussion about this.
>>>>> Additionally, I would propose to use Jacks' document
>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>>> as a common ground for the discussion and that everyone has a quick look
>>>>> before the next community sync. If you think the document is still missing
>>>>> some arguments, please make suggestions to add them. This way we have to
>>>>> spend less time to get everyone up to speed and have a more common
>>>>> terminology.
>>>>>
>>>>> Looking forward to the discussion, best wishes
>>>>>
>>>>> Jan
>>>>> On 02.03.24 02:06, Walaa Eldin Moustafa wrote:
>>>>>
>>>>> The calendar on the site is currently broken
>>>>> https://iceberg.apache.org/community/#iceberg-community-events. Might
>>>>> help to fix it or share the meeting link here.
>>>>>
>>>>> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> Sounds good, let's discuss this in person!
>>>>>>
>>>>>> I am a bit worried that we have quite a few critical topics going on
>>>>>> right now on devlist, and this will take up a lot of time to discuss. If 
>>>>>> it
>>>>>> ends up going for too long, l propose let us have a dedicated meeting, 
>>>>>> and
>>>>>> I am more than happy to organize it.
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> I think this thread has hit a point of diminishing returns and that
>>>>>>> we still don't have a common understanding of what the options under
>>>>>>> consideration actually are.
>>>>>>>
>>>>>>> Since we were already planning on discussing this at the next
>>>>>>> community sync, I suggest we pick this up there and use that time to 
>>>>>>> align
>>>>>>> on what exactly we're considering. We can then start a new thread to lay
>>>>>>> out the designs under consideration in more detail and then have a
>>>>>>> discussion about trade-offs.
>>>>>>>
>>>>>>> Does that sound reasonable?
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <
>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I am finding it hard to interpret the options concretely. I would
>>>>>>>> also suggest breaking the expectation/outcome to milestones. Maybe it
>>>>>>>> becomes easier if we agree to distinguish between an approach that is
>>>>>>>> feasible in the near term and another in the long term, especially if 
>>>>>>>> the
>>>>>>>> latter requires significant engine-side changes.
>>>>>>>>
>>>>>>>> Further, maybe it helps if we start with an option that fully
>>>>>>>> reuses the existing spec, and see how we view it in comparison with the
>>>>>>>> options discussed previously. I am sharing one below. It reuses the 
>>>>>>>> current
>>>>>>>> spec of Iceberg views and tables by leveraging table properties to 
>>>>>>>> capture
>>>>>>>> materialized view metadata. What is common (and not common) between 
>>>>>>>> this
>>>>>>>> and the desired representations?
>>>>>>>>
>>>>>>>> The new properties are:
>>>>>>>> Properties on a View:
>>>>>>>>
>>>>>>>>    1.
>>>>>>>>
>>>>>>>>    *iceberg.materialized.view*:
>>>>>>>>    - *Type*: View property
>>>>>>>>       - *Purpose*: This property is used to mark whether a view is
>>>>>>>>       a materialized view. If set to true, the view is treated as
>>>>>>>>       a materialized view. This helps in differentiating between 
>>>>>>>> virtual and
>>>>>>>>       materialized views within the catalog and dictates specific 
>>>>>>>> handling and
>>>>>>>>       validation logic for materialized views.
>>>>>>>>    2.
>>>>>>>>
>>>>>>>>    *iceberg.materialized.view.storage.location*:
>>>>>>>>    - *Type*: View property
>>>>>>>>       - *Purpose*: Specifies the location of the storage table
>>>>>>>>       associated with the materialized view. This property is used for 
>>>>>>>> linking a
>>>>>>>>       materialized view with its corresponding storage table, enabling 
>>>>>>>> data
>>>>>>>>       management and query execution based on the stored data 
>>>>>>>> freshness.
>>>>>>>>
>>>>>>>> Properties on a Table:
>>>>>>>>
>>>>>>>>    1. *base.snapshot.[UUID]*:
>>>>>>>>       - *Type*: Table property
>>>>>>>>       - *Purpose*: These properties store the snapshot IDs of the
>>>>>>>>       base tables at the time the materialized view's data was last 
>>>>>>>> updated. Each
>>>>>>>>       property is prefixed with base.snapshot. followed by the
>>>>>>>>       UUID of the base table. They are used to track whether the 
>>>>>>>> materialized
>>>>>>>>       view's data is up to date with the base tables by comparing 
>>>>>>>> these snapshot
>>>>>>>>       IDs with the current snapshot IDs of the base tables. If all the 
>>>>>>>> base
>>>>>>>>       tables' current snapshot IDs match the ones stored in these 
>>>>>>>> properties, the
>>>>>>>>       materialized view's data is considered fresh.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> > All of these approaches are aligned in one, specific way: the
>>>>>>>>> storage table is an iceberg table.
>>>>>>>>>
>>>>>>>>> I do not think that is true. I think people are aligned that we
>>>>>>>>> would like to re-use the Iceberg table metadata defined in the Iceberg
>>>>>>>>> table spec to express the data in MV, but I don't think it goes that 
>>>>>>>>> far to
>>>>>>>>> say it must be an Iceberg table. Once you have that mindset, then of 
>>>>>>>>> course
>>>>>>>>> option 1 (separate table and view) is the only option.
>>>>>>>>>
>>>>>>>>> > I don't think that is necessary and it significantly increases
>>>>>>>>> the complexity.
>>>>>>>>>
>>>>>>>>> And can you quantify what you mean by "significantly increases the
>>>>>>>>> complexity"? Seems like a lot of concerns are coming from the 
>>>>>>>>> tradeoff with
>>>>>>>>> complexity. We probably all agree that using option 7 (a completely 
>>>>>>>>> new
>>>>>>>>> metadata type) is a lot of work from scratch, that is why it is not
>>>>>>>>> favored. However, my understanding is that as long as we re-use the 
>>>>>>>>> view
>>>>>>>>> and table metadata, then the majority of the existing logic can be 
>>>>>>>>> reused.
>>>>>>>>> I think what we have gone through in Slack to draft the rough Java API
>>>>>>>>> shape helps here, because people can estimate the amount of effort 
>>>>>>>>> required
>>>>>>>>> to implement it. And I don't think they are **significantly** more 
>>>>>>>>> complex
>>>>>>>>> to implement. Could you elaborate more about the complexity that you
>>>>>>>>> imagine?
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <
>>>>>>>>> daniel.c.we...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I feel I've been most vocal about pushing back against options 2+
>>>>>>>>>> (or Ryan's categories of combined table/view, or new metadata type), 
>>>>>>>>>> so
>>>>>>>>>> I'll try to expand on my reasoning.
>>>>>>>>>>
>>>>>>>>>> I understand the appeal of creating a design where we encapsulate
>>>>>>>>>> the view/storage from both a structural and performance standpoint, 
>>>>>>>>>> but I
>>>>>>>>>> don't think that is necessary and it significantly increases the 
>>>>>>>>>> complexity.
>>>>>>>>>>
>>>>>>>>>> All of these approaches are aligned in one, specific way: the
>>>>>>>>>> storage table is an iceberg table.
>>>>>>>>>>
>>>>>>>>>> Because of this, all the behaviors and requirements still apply
>>>>>>>>>> to these tables.  They need to be maintained (snapshot cleanup, 
>>>>>>>>>> orphan
>>>>>>>>>> files), in cases need to be optimized (compaction, manifest 
>>>>>>>>>> rewrites), they
>>>>>>>>>> need to be able to be inspected (this will be even more important 
>>>>>>>>>> with MV
>>>>>>>>>> since staleness can produce different results and questions will 
>>>>>>>>>> arise
>>>>>>>>>> about what state the storage table was in).  There may be cases 
>>>>>>>>>> where the
>>>>>>>>>> tables need to be managed directly.
>>>>>>>>>>
>>>>>>>>>> Anywhere we deviate from the existing constructs/commit/access
>>>>>>>>>> for tables, we will ultimately have to then unwrap to re-expose the
>>>>>>>>>> underlying Iceberg behavior.  This creates unnecessary complexity in 
>>>>>>>>>> the
>>>>>>>>>> library/API layer, which are not the primary interface users will 
>>>>>>>>>> have with
>>>>>>>>>> materialized views where an engine is almost entirely necessary to 
>>>>>>>>>> interact
>>>>>>>>>> with the dataset.
>>>>>>>>>>
>>>>>>>>>> As to the performance concerns around option 1, I think we're
>>>>>>>>>> overstating the downsides.  It really comes down to how many 
>>>>>>>>>> metadata loads
>>>>>>>>>> are necessary and evaluating freshness would likely be the real 
>>>>>>>>>> bottleneck
>>>>>>>>>> as it involves potentially loading many tables.  All of the options 
>>>>>>>>>> are on
>>>>>>>>>> the same order of performance for the metadata and table loads.
>>>>>>>>>>
>>>>>>>>>> As to the visibility of tables and whether they're registered in
>>>>>>>>>> the catalog, I think registering in the catalog is the right 
>>>>>>>>>> approach so
>>>>>>>>>> that the tables are still addressable for maintenance/etc.  The 
>>>>>>>>>> visibility
>>>>>>>>>> of the storage table is a catalog implementation decision and 
>>>>>>>>>> shouldn't be
>>>>>>>>>> a requirement of the MV spec (I can see cases for both and it isn't
>>>>>>>>>> necessary to dictate a behavior).
>>>>>>>>>>
>>>>>>>>>> I'm still strongly in favor of Option 1 (separate table and view)
>>>>>>>>>> for these reasons.
>>>>>>>>>>
>>>>>>>>>> -Dan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> > Jack, it sounds like you’re the proponent of a combined table
>>>>>>>>>>> and view (rather than a new metadata spec for a materialized view). 
>>>>>>>>>>> What is
>>>>>>>>>>> the main motivation? It seems like you’re convinced of that 
>>>>>>>>>>> approach, but I
>>>>>>>>>>> don’t understand the advantage it brings.
>>>>>>>>>>>
>>>>>>>>>>> Sorry I have to make a Google Sheet to capture all the options
>>>>>>>>>>> we have discussed so far, I wanted to use the existing Google Doc, 
>>>>>>>>>>> but it
>>>>>>>>>>> has really bad table/sheet support...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>>>>>>>>
>>>>>>>>>>> I have listed all the options, with how they are implemented and
>>>>>>>>>>> some important considerations we have discussed so far. Note that:
>>>>>>>>>>> 1. This sheet currently excludes the lineage information, which
>>>>>>>>>>> we can discuss more later after the current topic is resolved.
>>>>>>>>>>> 2. I removed the considerations for REST integration since from
>>>>>>>>>>> the other thread we have clarified that they should be considered
>>>>>>>>>>> completely separately.
>>>>>>>>>>>
>>>>>>>>>>> *Why I come as a proponent of having a new MV object with table
>>>>>>>>>>> and view metadata file pointer*
>>>>>>>>>>>
>>>>>>>>>>> In my sheet, there are 3 options that do not have major problems:
>>>>>>>>>>> Option 2: Add storage table metadata file pointer in view object
>>>>>>>>>>> Option 5: New MV object with table and view metadata file
>>>>>>>>>>> pointer
>>>>>>>>>>> Option 6: New MV spec with table and view metadata
>>>>>>>>>>>
>>>>>>>>>>> I originally excluded option 2 because I think it does not align
>>>>>>>>>>> with the REST spec, but after the other discussion thread about 
>>>>>>>>>>> "Inconsistency
>>>>>>>>>>> between REST spec and table/view spec", I think my original concern 
>>>>>>>>>>> no
>>>>>>>>>>> longer holds true so now I put it back. And based on my
>>>>>>>>>>> personal preference that MV is an independent object that should be
>>>>>>>>>>> separated from view and table, plus the fact that option 5 is 
>>>>>>>>>>> probably less
>>>>>>>>>>> work than option 6 for implementation, that is how I come as a 
>>>>>>>>>>> proponent of
>>>>>>>>>>> option 5 at this moment.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Regarding Ryan's evaluation framework *
>>>>>>>>>>>
>>>>>>>>>>> I think we need to reconcile this sheet with Ryan's evaluation
>>>>>>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 
>>>>>>>>>>> all
>>>>>>>>>>> under the same category of "A combination of a view and a
>>>>>>>>>>> table" and concludes that they don't have any advantage for the 
>>>>>>>>>>> same set of
>>>>>>>>>>> reasons. But those reasons are not really convincing to me so let's 
>>>>>>>>>>> talk
>>>>>>>>>>> about them in more detail.
>>>>>>>>>>>
>>>>>>>>>>> (1) You said "I don’t see a reason why a combined view and
>>>>>>>>>>> table is advantageous" as "this would cause unnecessary dependence 
>>>>>>>>>>> between
>>>>>>>>>>> the view and table in catalogs."  What dependency exactly do you 
>>>>>>>>>>> mean here?
>>>>>>>>>>> And why is that unnecessary, given there has to be some sort of 
>>>>>>>>>>> dependency
>>>>>>>>>>> anyway unless we go with option 5 or 6?
>>>>>>>>>>>
>>>>>>>>>>> (2) You said "I guess there’s an argument that you could load
>>>>>>>>>>> both table and view metadata locations at the same time. That 
>>>>>>>>>>> hardly seems
>>>>>>>>>>> worth the trouble". I disagree with that. Catalog interaction 
>>>>>>>>>>> performance
>>>>>>>>>>> is critical to at least everyone working in EMR and Athena, and MV 
>>>>>>>>>>> itself
>>>>>>>>>>> as an acceleration approach needs to be as fast as possible.
>>>>>>>>>>>
>>>>>>>>>>> I have put 3 key operations in the doc that I think matters for
>>>>>>>>>>> MV during interactions with engine:
>>>>>>>>>>> 1. refreshes storage table
>>>>>>>>>>> 2. get the storage table of the MV
>>>>>>>>>>> 3. if stale, get the view SQL
>>>>>>>>>>>
>>>>>>>>>>> And option 1 clearly falls short with 4 sequential steps
>>>>>>>>>>> required to load a storage table. You mentioned "recent issues with 
>>>>>>>>>>> adding
>>>>>>>>>>> views to the JDBC catalog" in this topic, could you explain a bit 
>>>>>>>>>>> more?
>>>>>>>>>>>
>>>>>>>>>>> (3) You said "I also think that once we decide on structure, we
>>>>>>>>>>> can make it possible for REST catalog implementations to do smart 
>>>>>>>>>>> things,
>>>>>>>>>>> in a way that doesn’t put additional requirements on the underlying 
>>>>>>>>>>> catalog
>>>>>>>>>>> store." If REST is fully compatible with Iceberg spec then I have no
>>>>>>>>>>> problem with this statement. However, as we discussed in the other 
>>>>>>>>>>> thread,
>>>>>>>>>>> it is not the case. In the current state, I think the sequence of 
>>>>>>>>>>> action
>>>>>>>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) 
>>>>>>>>>>> first,
>>>>>>>>>>> and then think about how REST can incorporate it or do smart things 
>>>>>>>>>>> that
>>>>>>>>>>> are not Iceberg spec compliant. Do you agree with that?
>>>>>>>>>>>
>>>>>>>>>>> (4) You said the table identifier pointer "is a problem we need
>>>>>>>>>>> to solve generally because a materialized table needs to be able to 
>>>>>>>>>>> track
>>>>>>>>>>> the upstream state of tables that were used". I don't think that is 
>>>>>>>>>>> a
>>>>>>>>>>> reason to choose to use a table identifier pointer for a storage 
>>>>>>>>>>> table. The
>>>>>>>>>>> issue is not about using a table identifier pointer. It is about 
>>>>>>>>>>> exposing
>>>>>>>>>>> the storage table as a separate entity in the catalog, which is 
>>>>>>>>>>> what people
>>>>>>>>>>> do not like and is already discussed in length in Jan's question 3 
>>>>>>>>>>> (also
>>>>>>>>>>> linked in the sheet). I agree with that statement, because without 
>>>>>>>>>>> a REST
>>>>>>>>>>> implementation that can magically hide the storage table, this 
>>>>>>>>>>> model adds
>>>>>>>>>>> additional burden regarding compliance and data governance for any 
>>>>>>>>>>> other
>>>>>>>>>>> non-REST catalog implementations that are compliant to the Iceberg 
>>>>>>>>>>> spec.
>>>>>>>>>>> Many mechanisms need to be built in a catalog to hide, protect, 
>>>>>>>>>>> maintain,
>>>>>>>>>>> recycle the storage table, that can be avoided by using other 
>>>>>>>>>>> approaches. I
>>>>>>>>>>> think we should reach a consensus about that and discuss further if 
>>>>>>>>>>> you do
>>>>>>>>>>> not agree.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Jack Ye
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul
>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Ryan, we actually discussed your categories in this question
>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>>>>>>>>> Where your categories correspond to the following designs:
>>>>>>>>>>>>
>>>>>>>>>>>>    - Separate table and view => Design 1
>>>>>>>>>>>>    - Combination of view and table => Design 2
>>>>>>>>>>>>    - A new metadata type => Design 4
>>>>>>>>>>>>
>>>>>>>>>>>> Jan
>>>>>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like it wasn’t clear what I meant for the 3 categories,
>>>>>>>>>>>> so I’ll be more specific:
>>>>>>>>>>>>
>>>>>>>>>>>>    - *Separate table and view*: this option is to have the
>>>>>>>>>>>>    objects that we have today, with extra metadata. Commit 
>>>>>>>>>>>> processes are
>>>>>>>>>>>>    separate: committing to the table doesn’t alter the view and 
>>>>>>>>>>>> committing to
>>>>>>>>>>>>    the view doesn’t change the table. However, changing the view 
>>>>>>>>>>>> can make it
>>>>>>>>>>>>    so the table is no longer useful as a materialization.
>>>>>>>>>>>>    - *A combination of a view and a table*: in this option,
>>>>>>>>>>>>    the table metadata and view metadata are the same as the first 
>>>>>>>>>>>> option. The
>>>>>>>>>>>>    difference is that the commit process combines them, either by 
>>>>>>>>>>>> embedding a
>>>>>>>>>>>>    table metadata location in view metadata or by tracking both in 
>>>>>>>>>>>> the same
>>>>>>>>>>>>    catalog reference.
>>>>>>>>>>>>    - *A new metadata type*: this option is where we define a
>>>>>>>>>>>>    new metadata object that has view attributes, like SQL 
>>>>>>>>>>>> representations,
>>>>>>>>>>>>    along with table attributes, like partition specs and snapshots.
>>>>>>>>>>>>
>>>>>>>>>>>> Hopefully this is clear because I think much of the confusion
>>>>>>>>>>>> is caused by different definitions.
>>>>>>>>>>>>
>>>>>>>>>>>> The LoadTableResponse having optional metadata-location field
>>>>>>>>>>>> implies that the object in the catalog no longer needs to hold a 
>>>>>>>>>>>> metadata
>>>>>>>>>>>> file pointer
>>>>>>>>>>>>
>>>>>>>>>>>> The REST protocol has not removed the requirement for a
>>>>>>>>>>>> metadata file, so I’m going to keep focused on the MV design 
>>>>>>>>>>>> options.
>>>>>>>>>>>>
>>>>>>>>>>>> When we say a MV can be a “new metadata type”, it does not mean
>>>>>>>>>>>> it needs to define a completely brand new structure of the 
>>>>>>>>>>>> metadata content
>>>>>>>>>>>>
>>>>>>>>>>>> I’m making a distinction between separate metadata files for
>>>>>>>>>>>> the table and the view and a combined metadata object, as above.
>>>>>>>>>>>>
>>>>>>>>>>>> We can define an “Iceberg MV” to be an object in a catalog,
>>>>>>>>>>>> which has 1 table metadata file pointer, and 1 view metadata file 
>>>>>>>>>>>> pointer
>>>>>>>>>>>>
>>>>>>>>>>>> This is the option I am referring to as a “combination of a
>>>>>>>>>>>> view and a table”.
>>>>>>>>>>>>
>>>>>>>>>>>> So to review my initial email, I don’t see a reason why a
>>>>>>>>>>>> combined view and table is advantageous, either implemented by 
>>>>>>>>>>>> having a
>>>>>>>>>>>> catalog reference with two metadata locations or embedding a table 
>>>>>>>>>>>> metadata
>>>>>>>>>>>> location in view metadata. This would cause unnecessary dependence 
>>>>>>>>>>>> between
>>>>>>>>>>>> the view and table in catalogs. I guess there’s an argument that 
>>>>>>>>>>>> you could
>>>>>>>>>>>> load both table and view metadata locations at the same time. That 
>>>>>>>>>>>> hardly
>>>>>>>>>>>> seems worth the trouble given the recent issues with adding views 
>>>>>>>>>>>> to the
>>>>>>>>>>>> JDBC catalog.
>>>>>>>>>>>>
>>>>>>>>>>>> I also think that once we decide on structure, we can make it
>>>>>>>>>>>> possible for REST catalog implementations to do smart things, in a 
>>>>>>>>>>>> way that
>>>>>>>>>>>> doesn’t put additional requirements on the underlying catalog 
>>>>>>>>>>>> store. For
>>>>>>>>>>>> instance, we could specify how to send additional objects in a
>>>>>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table 
>>>>>>>>>>>> metadata. I
>>>>>>>>>>>> think these optimizations are a later addition, after we define the
>>>>>>>>>>>> relationship between views and tables.
>>>>>>>>>>>>
>>>>>>>>>>>> Jack, it sounds like you’re the proponent of a combined table
>>>>>>>>>>>> and view (rather than a new metadata spec for a materialized 
>>>>>>>>>>>> view). What is
>>>>>>>>>>>> the main motivation? It seems like you’re convinced of that 
>>>>>>>>>>>> approach, but I
>>>>>>>>>>>> don’t understand the advantage it brings.
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <
>>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes I mostly agree with the assessment.  To clarify a few
>>>>>>>>>>>>> minor points.
>>>>>>>>>>>>>
>>>>>>>>>>>>> is a materialized view a view and a separate table, a
>>>>>>>>>>>>>> combination of the two (i.e. commits are combined), or a new 
>>>>>>>>>>>>>> metadata type?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> For 'new metadata type', I consider mostly Jack's initial
>>>>>>>>>>>>> proposal of a new Catalog MV object that has two references 
>>>>>>>>>>>>> (ViewMetadata +
>>>>>>>>>>>>> TableMetadata).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The arguments that I see for a combined materialized view
>>>>>>>>>>>>>> object are:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Regular views are separate, rather than being tables
>>>>>>>>>>>>>>    with SQL and no data so it would be inconsistent (“Iceberg 
>>>>>>>>>>>>>> view is just a
>>>>>>>>>>>>>>    table with no data but with representations defined. But we 
>>>>>>>>>>>>>> did not do
>>>>>>>>>>>>>>    that.”)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>>>>>>    materialized views
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Tables are not typically exposed to end users — but
>>>>>>>>>>>>>>    this isn’t required by the separate view and table option
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For completeness, there seem to be a few additional ones
>>>>>>>>>>>>> (mentioned in the Slack and above messages).
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Lack of spec change (to ViewMetadata).  But as Jack says
>>>>>>>>>>>>>    it is a spec change (ie, to catalogs)
>>>>>>>>>>>>>    - A single call to get the View's StorageTable (versus two
>>>>>>>>>>>>>    calls)
>>>>>>>>>>>>>    - A more natural API, no opportunity for user to call
>>>>>>>>>>>>>    Catalog.dropTable() and renameTable() on storage table
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Thoughts:  *I think the long discussion sessions we had on
>>>>>>>>>>>>> Slack was fruitful for me, as seeing the API clarified some 
>>>>>>>>>>>>> things.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was initially more in favor of MV being a new metadata type
>>>>>>>>>>>>> (TableMetadata + ViewMetadata).  But seeing most of the MV 
>>>>>>>>>>>>> operations end
>>>>>>>>>>>>> up being ViewCatalog or Catalog operations, I am starting to 
>>>>>>>>>>>>> think API-wise
>>>>>>>>>>>>> that it may not align with the new metadata type (unless we define
>>>>>>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate 
>>>>>>>>>>>>> wrappers).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Initially one question I had for option 'a view and a separate
>>>>>>>>>>>>> table', was how to make this table reference (metadata.json or 
>>>>>>>>>>>>> catalog
>>>>>>>>>>>>> reference).  In the previous option, we had a precedent of Catalog
>>>>>>>>>>>>> references to Metadata, but not pointers between Metadatas.  I 
>>>>>>>>>>>>> initially
>>>>>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' 
>>>>>>>>>>>>> catalog
>>>>>>>>>>>>> concerns in ViewMetadata.  (I saw Catalog and ViewCatalog as a 
>>>>>>>>>>>>> layer above
>>>>>>>>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack 
>>>>>>>>>>>>> made a fair
>>>>>>>>>>>>> point that ViewMetadata already is tightly bound with a Catalog.  
>>>>>>>>>>>>> In this
>>>>>>>>>>>>> case, I think this approach does have its merits as well in 
>>>>>>>>>>>>> aligning
>>>>>>>>>>>>> Catalog API's with the metadata.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to provide my perspective on the question of
>>>>>>>>>>>>>> what a materialized view is and elaborate on Jack's recent 
>>>>>>>>>>>>>> proposal to view
>>>>>>>>>>>>>> a materialized view as a catalog concept.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Firstly, let's look at the role of the catalog. Every entity
>>>>>>>>>>>>>> in the catalog has a *unique identifier*, and the catalog
>>>>>>>>>>>>>> provides methods to create, load, and update these entities. An 
>>>>>>>>>>>>>> important
>>>>>>>>>>>>>> thing to note is that the catalog methods exhibit two different 
>>>>>>>>>>>>>> behaviors:
>>>>>>>>>>>>>> the *create and load methods deal with the entire entity*,
>>>>>>>>>>>>>> while the *update(commit) method only deals with partial
>>>>>>>>>>>>>> changes* to the entities.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the context of our current discussion, materialized view
>>>>>>>>>>>>>> (MV) metadata is a union of view and table metadata. The fact 
>>>>>>>>>>>>>> that the
>>>>>>>>>>>>>> update method deals only with partial changes, enables us to 
>>>>>>>>>>>>>> *reuse
>>>>>>>>>>>>>> the existing methods for updating tables and views*. For
>>>>>>>>>>>>>> updates we don't have to define what constitutes an entire 
>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>> view. Changes to a materialized view targeting the properties 
>>>>>>>>>>>>>> related to
>>>>>>>>>>>>>> the view metadata could use the update(commit) view method. 
>>>>>>>>>>>>>> Similarly,
>>>>>>>>>>>>>> changes targeting the properties related to the table metadata 
>>>>>>>>>>>>>> could use
>>>>>>>>>>>>>> the update(commit) table method. This is great news because we 
>>>>>>>>>>>>>> don't have
>>>>>>>>>>>>>> to redefine view and table commits (requirements, updates).
>>>>>>>>>>>>>> This is shown in the fact that Jack uses the same operation
>>>>>>>>>>>>>> to update the storage table for Option 1 and 3:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
>>>>>>>>>>>>>> // non-REST: update JSON files at table_metadata_location
>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The open question is *whether the create and load methods
>>>>>>>>>>>>>> should treat the properties that constitute the MV metadata as 
>>>>>>>>>>>>>> two entities
>>>>>>>>>>>>>> (View + Table) or one entity (new MV object)*. This is all
>>>>>>>>>>>>>> part of Jack's proposal, where Option 1 proposes a new MV 
>>>>>>>>>>>>>> object, and
>>>>>>>>>>>>>> Option 3 proposes two separate entities. The advantage of Option 
>>>>>>>>>>>>>> 1 is that
>>>>>>>>>>>>>> it doesn't require two operations to load the metadata. On the 
>>>>>>>>>>>>>> other hand,
>>>>>>>>>>>>>> the advantage of Option 3 is that no new operations or catalogs 
>>>>>>>>>>>>>> have to be
>>>>>>>>>>>>>> defined.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In my opinion, defining a new representation for materialized
>>>>>>>>>>>>>> views (Option 1) is generally the cleaner solution. However, I 
>>>>>>>>>>>>>> see a path
>>>>>>>>>>>>>> where we could first introduce Option 3 and still have the 
>>>>>>>>>>>>>> possibility to
>>>>>>>>>>>>>> transition to Option 1 if needed. The great thing about Option 3 
>>>>>>>>>>>>>> is that it
>>>>>>>>>>>>>> only requires minor changes to the current spec and is mostly
>>>>>>>>>>>>>> implementation detail.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Therefore I would propose small additions to Jacks Option 3
>>>>>>>>>>>>>> that only introduce changes to the spec that are not specific to
>>>>>>>>>>>>>> materialized views. The idea is to introduce boolean properties 
>>>>>>>>>>>>>> to be set
>>>>>>>>>>>>>> on the creation of the view and the storage table that indicate 
>>>>>>>>>>>>>> that they
>>>>>>>>>>>>>> belong to a materialized view. The view property "materialized" 
>>>>>>>>>>>>>> is set to
>>>>>>>>>>>>>> "true" for a MV and "false" for a regular view. And the table 
>>>>>>>>>>>>>> property
>>>>>>>>>>>>>> "storage_table" is set to "true" for a storage table and "false" 
>>>>>>>>>>>>>> for a
>>>>>>>>>>>>>> regular table. The absence of these properties indicates a 
>>>>>>>>>>>>>> regular view or
>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>>>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1",
>>>>>>>>>>>>>> "mv1"));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>>>>>>>>> // non-REST: load JSON file at table_metadata_location if
>>>>>>>>>>>>>> present
>>>>>>>>>>>>>> Table storageTable = view.storageTable();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>>>>>>>>> // non-REST: update JSON file at table_metadata_location
>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We could then introduce a new requirement for views and
>>>>>>>>>>>>>> tables called "AssertProperty" which could make sure to only 
>>>>>>>>>>>>>> perform
>>>>>>>>>>>>>> updates that are inline with materialized views. The additional 
>>>>>>>>>>>>>> requirement
>>>>>>>>>>>>>> can be seen as a general extension which does not need to be 
>>>>>>>>>>>>>> changed if we
>>>>>>>>>>>>>> decide to got with Option 1 in the future.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let me know what you think.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing
>>>>>>>>>>>>>> metadata definitions and minimizing spec changes are very 
>>>>>>>>>>>>>> important. This
>>>>>>>>>>>>>> also minimizes spec drift (between materialized views and views 
>>>>>>>>>>>>>> spec, and
>>>>>>>>>>>>>> between materialized views and tables spec), and simplifies the
>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In an effort to take the discussion forward with concrete
>>>>>>>>>>>>>> design options based on an end-to-end implementation, I have 
>>>>>>>>>>>>>> prototyped the
>>>>>>>>>>>>>> implementation (and added Spark support) in this PR
>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps
>>>>>>>>>>>>>> us reach convergence faster. More details about some of the 
>>>>>>>>>>>>>> design options
>>>>>>>>>>>>>> are discussed in the description of the PR.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I mean separate table and view metadata that is somehow
>>>>>>>>>>>>>>> combined through a commit process. For instance, keeping a 
>>>>>>>>>>>>>>> pointer to a
>>>>>>>>>>>>>>> table metadata file in a view metadata file or combining 
>>>>>>>>>>>>>>> commits to
>>>>>>>>>>>>>>> reference both. I don't see the value in either option.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks Ryan for the help to trace back to the root
>>>>>>>>>>>>>>>> question! Just a clarification question regarding your reply 
>>>>>>>>>>>>>>>> before I reply
>>>>>>>>>>>>>>>> further: what exactly does the option "a combination of the 
>>>>>>>>>>>>>>>> two (i.e.
>>>>>>>>>>>>>>>> commits are combined)" mean? How is that different from "a new 
>>>>>>>>>>>>>>>> metadata
>>>>>>>>>>>>>>>> type"?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I’m catching up on this conversation, so hopefully I can
>>>>>>>>>>>>>>>>> bring a fresh perspective.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Jack already pointed out that we need to start from the
>>>>>>>>>>>>>>>>> basics and I agree with that. Let’s remove voting at this 
>>>>>>>>>>>>>>>>> point. Right now
>>>>>>>>>>>>>>>>> is the time for discussing trade-offs, not lining up and 
>>>>>>>>>>>>>>>>> taking sides. I
>>>>>>>>>>>>>>>>> realize that wasn’t the intent with adding a vote, but that’s 
>>>>>>>>>>>>>>>>> almost always
>>>>>>>>>>>>>>>>> the result. It’s too easy to use it as a stand-in for 
>>>>>>>>>>>>>>>>> consensus and move on
>>>>>>>>>>>>>>>>> prematurely. I get the impression from the swirl in Slack 
>>>>>>>>>>>>>>>>> that discussion
>>>>>>>>>>>>>>>>> has moved ahead of agreement.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We’re still at the most basic question: is a materialized
>>>>>>>>>>>>>>>>> view a view and a separate table, a combination of the two 
>>>>>>>>>>>>>>>>> (i.e. commits
>>>>>>>>>>>>>>>>> are combined), or a new metadata type?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some
>>>>>>>>>>>>>>>>> kind of “system table” (meaning hidden?) or if it is exposed 
>>>>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>>>>> catalog. That’s a later choice (already pointed out) and, I 
>>>>>>>>>>>>>>>>> suspect, it
>>>>>>>>>>>>>>>>> should be delegated to catalog implementations.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To simplify this a little, I think that we can eliminate
>>>>>>>>>>>>>>>>> the option to combine table and view commits. I don’t think 
>>>>>>>>>>>>>>>>> there is a
>>>>>>>>>>>>>>>>> reason to combine the two. If separate, a table would track 
>>>>>>>>>>>>>>>>> the view
>>>>>>>>>>>>>>>>> version used along with freshness information for referenced 
>>>>>>>>>>>>>>>>> tables. If the
>>>>>>>>>>>>>>>>> table is automatically skipped when the version no longer 
>>>>>>>>>>>>>>>>> matches the view,
>>>>>>>>>>>>>>>>> then no action needs to happen when a view definition 
>>>>>>>>>>>>>>>>> changes. Similarly,
>>>>>>>>>>>>>>>>> the table can be updated independentl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to