Re: Materialized view integration with REST spec

Szehon Ho Fri, 22 Mar 2024 10:50:04 -0700

Sounds good to me, can you start a document then, and we can all contribute
there?


On Fri, Mar 22, 2024 at 10:47 AM Walaa Eldin Moustafa <[email protected]>
wrote:

> Let us list the pros and cons as originally planned. I can help as well if
> needed. We can get started and have Jack chime in when he is back?
>
> On Fri, Mar 22, 2024 at 10:35 AM Szehon Ho <[email protected]>
> wrote:
>
>> Hi
>>
>> My understanding was last time it was still unresolved, and the action
>> item was on Jack and/or/ Jan to make a shorter document.  I think the
>> debate now has boiled down to Ryan's three options:
>>
>>    1. separate table/view
>>    2. combination of table/view tied together via commit
>>    3. new metadata type
>>
>>  with probably the first and third being the main contenders. My
>> understanding was we wanted a table of pros/cons between (1) and (3),
>> presumably giving folks a chance to address the cons, before the next
>> meeting.
>>
>> Jack (main proponent of option (3) just went on paternity leave, so not
>> sure if there was someone from Amazon with some context of Jack's thought
>> to continue that train of thought though?  Otherwise maybe Jan can give it
>> a shot?  Else I will be out and can't make the next iceberg sync, but can
>> prepare one for the one after that, if needed.
>>
>> Re: 'new' proposal', not sure if we are ready for a formal one, given the
>> deadlock between the two options, but Im open to that as well to make a
>> proposal based on one of the options above.  What do folks think?
>>
>> Thanks,
>> Szehon
>>
>> On Fri, Mar 22, 2024 at 3:15 AM Renjie Liu <[email protected]>
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Mar 22, 2024 at 16:42 Jean-Baptiste Onofré <[email protected]>
>>> wrote:
>>>
>>>> Hi Renjie,
>>>>
>>>> We discussed the MV proposal, without yet reaching any conclusion.
>>>>
>>>> I propose:
>>>> - to use the "new" proposal process in place (creating an GH issue with
>>>> proposal flag, with link to the document)
>>>> - use the document and/or GH issue to add comments
>>>> - finalize the document heading to a vote (to get consensus)
>>>>
>>>> Thoughts ?
>>>>
>>>> NB: I will follow up with "stale PR/proposal" PR to be sure we are
>>>> moving forward ;)
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Fri, Mar 22, 2024 at 4:29 AM Renjie Liu <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi:
>>>>>
>>>>> Sorry I didn't make it to join the last community sync. Did we reach
>>>>> any conclusion about mv spec?
>>>>>
>>>>> On Tue, Mar 5, 2024 at 11:28 PM himadri pal <[email protected]> wrote:
>>>>>
>>>>>> For me the calendar link did not work in mobile, but I was able to
>>>>>> add the dev Google calendar from
>>>>>> https://iceberg.apache.org/community/#iceberg-community-events by
>>>>>> accessing it from  laptop.
>>>>>>
>>>>>> Regards,
>>>>>> Himadri Pal
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks Jack! I think the images are stripped from the message, but
>>>>>>> they are there on the doc
>>>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>>>>>  if
>>>>>>> someone wants to check them out (I have left some comments while there).
>>>>>>>
>>>>>>> Also I no longer see the community sync calendar
>>>>>>> https://iceberg.apache.org/community/#slack, so it is unclear when
>>>>>>> the meeting is (and we do not have the link).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 4, 2024 at 9:58 AM Jack Ye <[email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks Jan! +1 for everyone to take a look before the discussion,
>>>>>>>> and see if there are any missing options or major arguments.
>>>>>>>>
>>>>>>>> I have also added the images regarding all the options, it might be
>>>>>>>> easier to parse than the big sheet. I will also put it here for people 
>>>>>>>> that
>>>>>>>> do not have time to read through it:
>>>>>>>>
>>>>>>>>
>>>>>>>> *Option 1: Add storage table identifier in view metadata content*
>>>>>>>>
>>>>>>>> [image: MV option 1.png]
>>>>>>>> *Option 2: Add storage table metadata file pointer in view object*
>>>>>>>>
>>>>>>>> [image: MV option 2.png]
>>>>>>>> *Option 3: Add storage table metadata file pointer in view metadata
>>>>>>>> content*
>>>>>>>>
>>>>>>>> [image: MV option 3.png]
>>>>>>>>
>>>>>>>> *Option 4: Embed table metadata in view metadata content*
>>>>>>>>
>>>>>>>> [image: MV option 4.png]
>>>>>>>> *Option 5: New MV spec, MV object has table and view metadata file
>>>>>>>> pointers*
>>>>>>>>
>>>>>>>> [image: MV option 5.png]
>>>>>>>> *Option 6: New MV spec, MV metadata content embeds table and view
>>>>>>>> metadata*
>>>>>>>>
>>>>>>>> [image: MV option 6.png]
>>>>>>>> *Option 7: New MV spec, completely new MV metadata content*
>>>>>>>>
>>>>>>>> [image: MV option 7.png]
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> I think it's great to have a face to face discussion about this.
>>>>>>>>> Additionally, I would propose to use Jacks' document
>>>>>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>>>>>>> as a common ground for the discussion and that everyone has a quick 
>>>>>>>>> look
>>>>>>>>> before the next community sync. If you think the document is still 
>>>>>>>>> missing
>>>>>>>>> some arguments, please make suggestions to add them. This way we have 
>>>>>>>>> to
>>>>>>>>> spend less time to get everyone up to speed and have a more common
>>>>>>>>> terminology.
>>>>>>>>>
>>>>>>>>> Looking forward to the discussion, best wishes
>>>>>>>>>
>>>>>>>>> Jan
>>>>>>>>> On 02.03.24 02:06, Walaa Eldin Moustafa wrote:
>>>>>>>>>
>>>>>>>>> The calendar on the site is currently broken
>>>>>>>>> https://iceberg.apache.org/community/#iceberg-community-events.
>>>>>>>>> Might help to fix it or share the meeting link here.
>>>>>>>>>
>>>>>>>>> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sounds good, let's discuss this in person!
>>>>>>>>>>
>>>>>>>>>> I am a bit worried that we have quite a few critical topics going
>>>>>>>>>> on right now on devlist, and this will take up a lot of time to 
>>>>>>>>>> discuss. If
>>>>>>>>>> it ends up going for too long, l propose let us have a dedicated 
>>>>>>>>>> meeting,
>>>>>>>>>> and I am more than happy to organize it.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jack Ye
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>
>>>>>>>>>>> I think this thread has hit a point of diminishing returns and
>>>>>>>>>>> that we still don't have a common understanding of what the options 
>>>>>>>>>>> under
>>>>>>>>>>> consideration actually are.
>>>>>>>>>>>
>>>>>>>>>>> Since we were already planning on discussing this at the next
>>>>>>>>>>> community sync, I suggest we pick this up there and use that time 
>>>>>>>>>>> to align
>>>>>>>>>>> on what exactly we're considering. We can then start a new thread 
>>>>>>>>>>> to lay
>>>>>>>>>>> out the designs under consideration in more detail and then have a
>>>>>>>>>>> discussion about trade-offs.
>>>>>>>>>>>
>>>>>>>>>>> Does that sound reasonable?
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I am finding it hard to interpret the options concretely. I
>>>>>>>>>>>> would also suggest breaking the expectation/outcome to milestones. 
>>>>>>>>>>>> Maybe it
>>>>>>>>>>>> becomes easier if we agree to distinguish between an approach that 
>>>>>>>>>>>> is
>>>>>>>>>>>> feasible in the near term and another in the long term, especially 
>>>>>>>>>>>> if the
>>>>>>>>>>>> latter requires significant engine-side changes.
>>>>>>>>>>>>
>>>>>>>>>>>> Further, maybe it helps if we start with an option that fully
>>>>>>>>>>>> reuses the existing spec, and see how we view it in comparison 
>>>>>>>>>>>> with the
>>>>>>>>>>>> options discussed previously. I am sharing one below. It reuses 
>>>>>>>>>>>> the current
>>>>>>>>>>>> spec of Iceberg views and tables by leveraging table properties to 
>>>>>>>>>>>> capture
>>>>>>>>>>>> materialized view metadata. What is common (and not common) 
>>>>>>>>>>>> between this
>>>>>>>>>>>> and the desired representations?
>>>>>>>>>>>>
>>>>>>>>>>>> The new properties are:
>>>>>>>>>>>> Properties on a View:
>>>>>>>>>>>>
>>>>>>>>>>>>    1.
>>>>>>>>>>>>
>>>>>>>>>>>>    *iceberg.materialized.view*:
>>>>>>>>>>>>    - *Type*: View property
>>>>>>>>>>>>       - *Purpose*: This property is used to mark whether a
>>>>>>>>>>>>       view is a materialized view. If set to true, the view is
>>>>>>>>>>>>       treated as a materialized view. This helps in 
>>>>>>>>>>>> differentiating between
>>>>>>>>>>>>       virtual and materialized views within the catalog and 
>>>>>>>>>>>> dictates specific
>>>>>>>>>>>>       handling and validation logic for materialized views.
>>>>>>>>>>>>    2.
>>>>>>>>>>>>
>>>>>>>>>>>>    *iceberg.materialized.view.storage.location*:
>>>>>>>>>>>>    - *Type*: View property
>>>>>>>>>>>>       - *Purpose*: Specifies the location of the storage table
>>>>>>>>>>>>       associated with the materialized view. This property is used 
>>>>>>>>>>>> for linking a
>>>>>>>>>>>>       materialized view with its corresponding storage table, 
>>>>>>>>>>>> enabling data
>>>>>>>>>>>>       management and query execution based on the stored data 
>>>>>>>>>>>> freshness.
>>>>>>>>>>>>
>>>>>>>>>>>> Properties on a Table:
>>>>>>>>>>>>
>>>>>>>>>>>>    1. *base.snapshot.[UUID]*:
>>>>>>>>>>>>       - *Type*: Table property
>>>>>>>>>>>>       - *Purpose*: These properties store the snapshot IDs of
>>>>>>>>>>>>       the base tables at the time the materialized view's data was 
>>>>>>>>>>>> last updated.
>>>>>>>>>>>>       Each property is prefixed with base.snapshot. followed
>>>>>>>>>>>>       by the UUID of the base table. They are used to track 
>>>>>>>>>>>> whether the
>>>>>>>>>>>>       materialized view's data is up to date with the base tables 
>>>>>>>>>>>> by comparing
>>>>>>>>>>>>       these snapshot IDs with the current snapshot IDs of the base 
>>>>>>>>>>>> tables. If all
>>>>>>>>>>>>       the base tables' current snapshot IDs match the ones stored 
>>>>>>>>>>>> in these
>>>>>>>>>>>>       properties, the materialized view's data is considered fresh.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> > All of these approaches are aligned in one, specific way:
>>>>>>>>>>>>> the storage table is an iceberg table.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I do not think that is true. I think people are aligned that
>>>>>>>>>>>>> we would like to re-use the Iceberg table metadata defined in the 
>>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>> table spec to express the data in MV, but I don't think it goes 
>>>>>>>>>>>>> that far to
>>>>>>>>>>>>> say it must be an Iceberg table. Once you have that mindset, then 
>>>>>>>>>>>>> of course
>>>>>>>>>>>>> option 1 (separate table and view) is the only option.
>>>>>>>>>>>>>
>>>>>>>>>>>>> > I don't think that is necessary and it
>>>>>>>>>>>>> significantly increases the complexity.
>>>>>>>>>>>>>
>>>>>>>>>>>>> And can you quantify what you mean by "significantly increases
>>>>>>>>>>>>> the complexity"? Seems like a lot of concerns are coming from the 
>>>>>>>>>>>>> tradeoff
>>>>>>>>>>>>> with complexity. We probably all agree that using option 7 (a 
>>>>>>>>>>>>> completely
>>>>>>>>>>>>> new metadata type) is a lot of work from scratch, that is why it 
>>>>>>>>>>>>> is not
>>>>>>>>>>>>> favored. However, my understanding is that as long as we re-use 
>>>>>>>>>>>>> the view
>>>>>>>>>>>>> and table metadata, then the majority of the existing logic can 
>>>>>>>>>>>>> be reused.
>>>>>>>>>>>>> I think what we have gone through in Slack to draft the rough 
>>>>>>>>>>>>> Java API
>>>>>>>>>>>>> shape helps here, because people can estimate the amount of 
>>>>>>>>>>>>> effort required
>>>>>>>>>>>>> to implement it. And I don't think they are **significantly** 
>>>>>>>>>>>>> more complex
>>>>>>>>>>>>> to implement. Could you elaborate more about the complexity that 
>>>>>>>>>>>>> you
>>>>>>>>>>>>> imagine?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I feel I've been most vocal about pushing back against
>>>>>>>>>>>>>> options 2+ (or Ryan's categories of combined table/view, or new 
>>>>>>>>>>>>>> metadata
>>>>>>>>>>>>>> type), so I'll try to expand on my reasoning.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I understand the appeal of creating a design where we
>>>>>>>>>>>>>> encapsulate the view/storage from both a structural and 
>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>> standpoint, but I don't think that is necessary and it
>>>>>>>>>>>>>> significantly increases the complexity.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> All of these approaches are aligned in one, specific way: the
>>>>>>>>>>>>>> storage table is an iceberg table.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Because of this, all the behaviors and requirements
>>>>>>>>>>>>>> still apply to these tables.  They need to be maintained 
>>>>>>>>>>>>>> (snapshot cleanup,
>>>>>>>>>>>>>> orphan files), in cases need to be optimized (compaction, 
>>>>>>>>>>>>>> manifest
>>>>>>>>>>>>>> rewrites), they need to be able to be inspected (this will be 
>>>>>>>>>>>>>> even more
>>>>>>>>>>>>>> important with MV since staleness can produce different results 
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> questions will arise about what state the storage table was in). 
>>>>>>>>>>>>>>  There may
>>>>>>>>>>>>>> be cases where the tables need to be managed directly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Anywhere we deviate from the existing
>>>>>>>>>>>>>> constructs/commit/access for tables, we will ultimately have to 
>>>>>>>>>>>>>> then
>>>>>>>>>>>>>> unwrap to re-expose the underlying Iceberg behavior.  This 
>>>>>>>>>>>>>> creates
>>>>>>>>>>>>>> unnecessary complexity in the library/API layer, which are not 
>>>>>>>>>>>>>> the primary
>>>>>>>>>>>>>> interface users will have with materialized views where an 
>>>>>>>>>>>>>> engine is almost
>>>>>>>>>>>>>> entirely necessary to interact with the dataset.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As to the performance concerns around option 1, I think we're
>>>>>>>>>>>>>> overstating the downsides.  It really comes down to how many 
>>>>>>>>>>>>>> metadata loads
>>>>>>>>>>>>>> are necessary and evaluating freshness would likely be the real 
>>>>>>>>>>>>>> bottleneck
>>>>>>>>>>>>>> as it involves potentially loading many tables.  All of the 
>>>>>>>>>>>>>> options are on
>>>>>>>>>>>>>> the same order of performance for the metadata and table loads.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As to the visibility of tables and whether they're registered
>>>>>>>>>>>>>> in the catalog, I think registering in the catalog is the right 
>>>>>>>>>>>>>> approach so
>>>>>>>>>>>>>> that the tables are still addressable for maintenance/etc.  The 
>>>>>>>>>>>>>> visibility
>>>>>>>>>>>>>> of the storage table is a catalog implementation decision and 
>>>>>>>>>>>>>> shouldn't be
>>>>>>>>>>>>>> a requirement of the MV spec (I can see cases for both and it 
>>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>> necessary to dictate a behavior).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm still strongly in favor of Option 1 (separate table and
>>>>>>>>>>>>>> view) for these reasons.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Dan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > Jack, it sounds like you’re the proponent of a combined
>>>>>>>>>>>>>>> table and view (rather than a new metadata spec for a 
>>>>>>>>>>>>>>> materialized view).
>>>>>>>>>>>>>>> What is the main motivation? It seems like you’re convinced of 
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> approach, but I don’t understand the advantage it brings.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sorry I have to make a Google Sheet to capture all the
>>>>>>>>>>>>>>> options we have discussed so far, I wanted to use the existing 
>>>>>>>>>>>>>>> Google Doc,
>>>>>>>>>>>>>>> but it has really bad table/sheet support...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have listed all the options, with how they are implemented
>>>>>>>>>>>>>>> and some important considerations we have discussed so far. 
>>>>>>>>>>>>>>> Note that:
>>>>>>>>>>>>>>> 1. This sheet currently excludes the lineage information,
>>>>>>>>>>>>>>> which we can discuss more later after the current topic is 
>>>>>>>>>>>>>>> resolved.
>>>>>>>>>>>>>>> 2. I removed the considerations for REST integration since
>>>>>>>>>>>>>>> from the other thread we have clarified that they should be 
>>>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>> completely separately.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Why I come as a proponent of having a new MV object with
>>>>>>>>>>>>>>> table and view metadata file pointer*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In my sheet, there are 3 options that do not have major
>>>>>>>>>>>>>>> problems:
>>>>>>>>>>>>>>> Option 2: Add storage table metadata file pointer in view
>>>>>>>>>>>>>>> object
>>>>>>>>>>>>>>> Option 5: New MV object with table and view metadata file
>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>> Option 6: New MV spec with table and view metadata
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I originally excluded option 2 because I think it does not
>>>>>>>>>>>>>>> align with the REST spec, but after the other discussion thread 
>>>>>>>>>>>>>>> about "Inconsistency
>>>>>>>>>>>>>>> between REST spec and table/view spec", I think my original 
>>>>>>>>>>>>>>> concern no
>>>>>>>>>>>>>>> longer holds true so now I put it back. And based on my
>>>>>>>>>>>>>>> personal preference that MV is an independent object that 
>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>> separated from view and table, plus the fact that option 5 is 
>>>>>>>>>>>>>>> probably less
>>>>>>>>>>>>>>> work than option 6 for implementation, that is how I come as a 
>>>>>>>>>>>>>>> proponent of
>>>>>>>>>>>>>>> option 5 at this moment.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Regarding Ryan's evaluation framework *
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think we need to reconcile this sheet with Ryan's
>>>>>>>>>>>>>>> evaluation framework. That framework categorization puts option 
>>>>>>>>>>>>>>> 2, 3, 4, 5,
>>>>>>>>>>>>>>> 6 all under the same category of "A combination of a view
>>>>>>>>>>>>>>> and a table" and concludes that they don't have any advantage 
>>>>>>>>>>>>>>> for the same
>>>>>>>>>>>>>>> set of reasons. But those reasons are not really convincing to 
>>>>>>>>>>>>>>> me so let's
>>>>>>>>>>>>>>> talk about them in more detail.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (1) You said "I don’t see a reason why a combined view and
>>>>>>>>>>>>>>> table is advantageous" as "this would cause unnecessary 
>>>>>>>>>>>>>>> dependence between
>>>>>>>>>>>>>>> the view and table in catalogs."  What dependency exactly do 
>>>>>>>>>>>>>>> you mean here?
>>>>>>>>>>>>>>> And why is that unnecessary, given there has to be some sort of 
>>>>>>>>>>>>>>> dependency
>>>>>>>>>>>>>>> anyway unless we go with option 5 or 6?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (2) You said "I guess there’s an argument that you could
>>>>>>>>>>>>>>> load both table and view metadata locations at the same time. 
>>>>>>>>>>>>>>> That hardly
>>>>>>>>>>>>>>> seems worth the trouble". I disagree with that. Catalog 
>>>>>>>>>>>>>>> interaction
>>>>>>>>>>>>>>> performance is critical to at least everyone working in EMR and 
>>>>>>>>>>>>>>> Athena, and
>>>>>>>>>>>>>>> MV itself as an acceleration approach needs to be as fast as 
>>>>>>>>>>>>>>> possible.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have put 3 key operations in the doc that I think matters
>>>>>>>>>>>>>>> for MV during interactions with engine:
>>>>>>>>>>>>>>> 1. refreshes storage table
>>>>>>>>>>>>>>> 2. get the storage table of the MV
>>>>>>>>>>>>>>> 3. if stale, get the view SQL
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And option 1 clearly falls short with 4 sequential steps
>>>>>>>>>>>>>>> required to load a storage table. You mentioned "recent issues 
>>>>>>>>>>>>>>> with adding
>>>>>>>>>>>>>>> views to the JDBC catalog" in this topic, could you explain a 
>>>>>>>>>>>>>>> bit more?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (3) You said "I also think that once we decide on structure,
>>>>>>>>>>>>>>> we can make it possible for REST catalog implementations to do 
>>>>>>>>>>>>>>> smart
>>>>>>>>>>>>>>> things, in a way that doesn’t put additional requirements on 
>>>>>>>>>>>>>>> the underlying
>>>>>>>>>>>>>>> catalog store." If REST is fully compatible with Iceberg spec 
>>>>>>>>>>>>>>> then I have
>>>>>>>>>>>>>>> no problem with this statement. However, as we discussed in the 
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>> thread, it is not the case. In the current state, I think the 
>>>>>>>>>>>>>>> sequence of
>>>>>>>>>>>>>>> action should be to evolve the Iceberg table/view spec (or add 
>>>>>>>>>>>>>>> a MV spec)
>>>>>>>>>>>>>>> first, and then think about how REST can incorporate it or do 
>>>>>>>>>>>>>>> smart things
>>>>>>>>>>>>>>> that are not Iceberg spec compliant. Do you agree with that?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (4) You said the table identifier pointer "is a problem we
>>>>>>>>>>>>>>> need to solve generally because a materialized table needs to 
>>>>>>>>>>>>>>> be able to
>>>>>>>>>>>>>>> track the upstream state of tables that were used". I don't 
>>>>>>>>>>>>>>> think that is a
>>>>>>>>>>>>>>> reason to choose to use a table identifier pointer for a 
>>>>>>>>>>>>>>> storage table. The
>>>>>>>>>>>>>>> issue is not about using a table identifier pointer. It is 
>>>>>>>>>>>>>>> about exposing
>>>>>>>>>>>>>>> the storage table as a separate entity in the catalog, which is 
>>>>>>>>>>>>>>> what people
>>>>>>>>>>>>>>> do not like and is already discussed in length in Jan's 
>>>>>>>>>>>>>>> question 3 (also
>>>>>>>>>>>>>>> linked in the sheet). I agree with that statement, because 
>>>>>>>>>>>>>>> without a REST
>>>>>>>>>>>>>>> implementation that can magically hide the storage table, this 
>>>>>>>>>>>>>>> model adds
>>>>>>>>>>>>>>> additional burden regarding compliance and data governance for 
>>>>>>>>>>>>>>> any other
>>>>>>>>>>>>>>> non-REST catalog implementations that are compliant to the 
>>>>>>>>>>>>>>> Iceberg spec.
>>>>>>>>>>>>>>> Many mechanisms need to be built in a catalog to hide, protect, 
>>>>>>>>>>>>>>> maintain,
>>>>>>>>>>>>>>> recycle the storage table, that can be avoided by using other 
>>>>>>>>>>>>>>> approaches. I
>>>>>>>>>>>>>>> think we should reach a consensus about that and discuss 
>>>>>>>>>>>>>>> further if you do
>>>>>>>>>>>>>>> not agree.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul
>>>>>>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Ryan, we actually discussed your categories in this
>>>>>>>>>>>>>>>> question
>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>>>>>>>>>>>>> Where your categories correspond to the following designs:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - Separate table and view => Design 1
>>>>>>>>>>>>>>>>    - Combination of view and table => Design 2
>>>>>>>>>>>>>>>>    - A new metadata type => Design 4
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looks like it wasn’t clear what I meant for the 3
>>>>>>>>>>>>>>>> categories, so I’ll be more specific:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - *Separate table and view*: this option is to have the
>>>>>>>>>>>>>>>>    objects that we have today, with extra metadata. Commit 
>>>>>>>>>>>>>>>> processes are
>>>>>>>>>>>>>>>>    separate: committing to the table doesn’t alter the view 
>>>>>>>>>>>>>>>> and committing to
>>>>>>>>>>>>>>>>    the view doesn’t change the table. However, changing the 
>>>>>>>>>>>>>>>> view can make it
>>>>>>>>>>>>>>>>    so the table is no longer useful as a materialization.
>>>>>>>>>>>>>>>>    - *A combination of a view and a table*: in this
>>>>>>>>>>>>>>>>    option, the table metadata and view metadata are the same 
>>>>>>>>>>>>>>>> as the first
>>>>>>>>>>>>>>>>    option. The difference is that the commit process combines 
>>>>>>>>>>>>>>>> them, either by
>>>>>>>>>>>>>>>>    embedding a table metadata location in view metadata or by 
>>>>>>>>>>>>>>>> tracking both in
>>>>>>>>>>>>>>>>    the same catalog reference.
>>>>>>>>>>>>>>>>    - *A new metadata type*: this option is where we define
>>>>>>>>>>>>>>>>    a new metadata object that has view attributes, like SQL 
>>>>>>>>>>>>>>>> representations,
>>>>>>>>>>>>>>>>    along with table attributes, like partition specs and 
>>>>>>>>>>>>>>>> snapshots.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hopefully this is clear because I think much of the
>>>>>>>>>>>>>>>> confusion is caused by different definitions.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The LoadTableResponse having optional metadata-location
>>>>>>>>>>>>>>>> field implies that the object in the catalog no longer needs 
>>>>>>>>>>>>>>>> to hold a
>>>>>>>>>>>>>>>> metadata file pointer
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The REST protocol has not removed the requirement for a
>>>>>>>>>>>>>>>> metadata file, so I’m going to keep focused on the MV design 
>>>>>>>>>>>>>>>> options.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When we say a MV can be a “new metadata type”, it does not
>>>>>>>>>>>>>>>> mean it needs to define a completely brand new structure of 
>>>>>>>>>>>>>>>> the metadata
>>>>>>>>>>>>>>>> content
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I’m making a distinction between separate metadata files
>>>>>>>>>>>>>>>> for the table and the view and a combined metadata object, as 
>>>>>>>>>>>>>>>> above.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We can define an “Iceberg MV” to be an object in a catalog,
>>>>>>>>>>>>>>>> which has 1 table metadata file pointer, and 1 view metadata 
>>>>>>>>>>>>>>>> file pointer
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is the option I am referring to as a “combination of a
>>>>>>>>>>>>>>>> view and a table”.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So to review my initial email, I don’t see a reason why a
>>>>>>>>>>>>>>>> combined view and table is advantageous, either implemented by 
>>>>>>>>>>>>>>>> having a
>>>>>>>>>>>>>>>> catalog reference with two metadata locations or embedding a 
>>>>>>>>>>>>>>>> table metadata
>>>>>>>>>>>>>>>> location in view metadata. This would cause unnecessary 
>>>>>>>>>>>>>>>> dependence between
>>>>>>>>>>>>>>>> the view and table in catalogs. I guess there’s an argument 
>>>>>>>>>>>>>>>> that you could
>>>>>>>>>>>>>>>> load both table and view metadata locations at the same time. 
>>>>>>>>>>>>>>>> That hardly
>>>>>>>>>>>>>>>> seems worth the trouble given the recent issues with adding 
>>>>>>>>>>>>>>>> views to the
>>>>>>>>>>>>>>>> JDBC catalog.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I also think that once we decide on structure, we can make
>>>>>>>>>>>>>>>> it possible for REST catalog implementations to do smart 
>>>>>>>>>>>>>>>> things, in a way
>>>>>>>>>>>>>>>> that doesn’t put additional requirements on the underlying 
>>>>>>>>>>>>>>>> catalog store.
>>>>>>>>>>>>>>>> For instance, we could specify how to send additional objects 
>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table 
>>>>>>>>>>>>>>>> metadata. I
>>>>>>>>>>>>>>>> think these optimizations are a later addition, after we 
>>>>>>>>>>>>>>>> define the
>>>>>>>>>>>>>>>> relationship between views and tables.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Jack, it sounds like you’re the proponent of a combined
>>>>>>>>>>>>>>>> table and view (rather than a new metadata spec for a 
>>>>>>>>>>>>>>>> materialized view).
>>>>>>>>>>>>>>>> What is the main motivation? It seems like you’re convinced of 
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> approach, but I don’t understand the advantage it brings.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes I mostly agree with the assessment.  To clarify a few
>>>>>>>>>>>>>>>>> minor points.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> is a materialized view a view and a separate table, a
>>>>>>>>>>>>>>>>>> combination of the two (i.e. commits are combined), or a new 
>>>>>>>>>>>>>>>>>> metadata type?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For 'new metadata type', I consider mostly Jack's initial
>>>>>>>>>>>>>>>>> proposal of a new Catalog MV object that has two references 
>>>>>>>>>>>>>>>>> (ViewMetadata +
>>>>>>>>>>>>>>>>> TableMetadata).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The arguments that I see for a combined materialized view
>>>>>>>>>>>>>>>>>> object are:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - Regular views are separate, rather than being
>>>>>>>>>>>>>>>>>>    tables with SQL and no data so it would be inconsistent 
>>>>>>>>>>>>>>>>>> (“Iceberg view is
>>>>>>>>>>>>>>>>>>    just a table with no data but with representations 
>>>>>>>>>>>>>>>>>> defined. But we did not
>>>>>>>>>>>>>>>>>>    do that.”)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - Tables may be a superset of functionality needed
>>>>>>>>>>>>>>>>>>    for materialized views
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - Tables are not typically exposed to end users — but
>>>>>>>>>>>>>>>>>>    this isn’t required by the separate view and table option
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For completeness, there seem to be a few additional ones
>>>>>>>>>>>>>>>>> (mentioned in the Slack and above messages).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - Lack of spec change (to ViewMetadata).  But as Jack
>>>>>>>>>>>>>>>>>    says it is a spec change (ie, to catalogs)
>>>>>>>>>>>>>>>>>    - A single call to get the View's StorageTable (versus
>>>>>>>>>>>>>>>>>    two calls)
>>>>>>>>>>>>>>>>>    - A more natural API, no opportunity for user to call
>>>>>>>>>>>>>>>>>    Catalog.dropTable() and renameTable() on storage table
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Thoughts:  *I think the long discussion sessions we had
>>>>>>>>>>>>>>>>> on Slack was fruitful for me, as seeing the API clarified 
>>>>>>>>>>>>>>>>> some things.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I was initially more in favor of MV being a new metadata
>>>>>>>>>>>>>>>>> type (TableMetadata + ViewMetadata).  But seeing most of the 
>>>>>>>>>>>>>>>>> MV operations
>>>>>>>>>>>>>>>>> end up being ViewCatalog or Catalog operations, I am starting 
>>>>>>>>>>>>>>>>> to think
>>>>>>>>>>>>>>>>> API-wise that it may not align with the new metadata type 
>>>>>>>>>>>>>>>>> (unless we define
>>>>>>>>>>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate 
>>>>>>>>>>>>>>>>> wrappers).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Initially one question I had for option 'a view and a
>>>>>>>>>>>>>>>>> separate table', was how to make this table reference 
>>>>>>>>>>>>>>>>> (metadata.json or
>>>>>>>>>>>>>>>>> catalog reference).  In the previous option, we had a 
>>>>>>>>>>>>>>>>> precedent of Catalog
>>>>>>>>>>>>>>>>> references to Metadata, but not pointers between Metadatas.  
>>>>>>>>>>>>>>>>> I initially
>>>>>>>>>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 
>>>>>>>>>>>>>>>>> 'polluting' catalog
>>>>>>>>>>>>>>>>> concerns in ViewMetadata.  (I saw Catalog and ViewCatalog as 
>>>>>>>>>>>>>>>>> a layer above
>>>>>>>>>>>>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the 
>>>>>>>>>>>>>>>>> Slack made a fair
>>>>>>>>>>>>>>>>> point that ViewMetadata already is tightly bound with a 
>>>>>>>>>>>>>>>>> Catalog.  In this
>>>>>>>>>>>>>>>>> case, I think this approach does have its merits as well in 
>>>>>>>>>>>>>>>>> aligning
>>>>>>>>>>>>>>>>> Catalog API's with the metadata.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I would like to provide my perspective on the question of
>>>>>>>>>>>>>>>>>> what a materialized view is and elaborate on Jack's recent 
>>>>>>>>>>>>>>>>>> proposal to view
>>>>>>>>>>>>>>>>>> a materialized view as a catalog concept.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Firstly, let's look at the role of the catalog. Every
>>>>>>>>>>>>>>>>>> entity in the catalog has a *unique identifier*, and the
>>>>>>>>>>>>>>>>>> catalog provides methods to create, load, and update these 
>>>>>>>>>>>>>>>>>> entities. An
>>>>>>>>>>>>>>>>>> important thing to note is that the catalog methods exhibit 
>>>>>>>>>>>>>>>>>> two different
>>>>>>>>>>>>>>>>>> behaviors: the *create and load methods deal with the
>>>>>>>>>>>>>>>>>> entire entity*, while the *update(commit) method only
>>>>>>>>>>>>>>>>>> deals with partial changes* to the entities.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In the context of our current discussion, materialized
>>>>>>>>>>>>>>>>>> view (MV) metadata is a union of view and table metadata. 
>>>>>>>>>>>>>>>>>> The fact that the
>>>>>>>>>>>>>>>>>> update method deals only with partial changes, enables us to 
>>>>>>>>>>>>>>>>>> *reuse
>>>>>>>>>>>>>>>>>> the existing methods for updating tables and views*. For
>>>>>>>>>>>>>>>>>> updates we don't have to define what constitutes an entire 
>>>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>>>> view. Changes to a materialized view targeting the 
>>>>>>>>>>>>>>>>>> properties related to
>>>>>>>>>>>>>>>>>> the view metadata could use the update(commit) view method. 
>>>>>>>>>>>>>>>>>> Similarly,
>>>>>>>>>>>>>>>>>> changes targeting the properties related to the table 
>>>>>>>>>>>>>>>>>> metadata could use
>>>>>>>>>>>>>>>>>> the update(commit) table method. This is great news because 
>>>>>>>>>>>>>>>>>> we don't have
>>>>>>>>>>>>>>>>>> to redefine view and table commits (requirements, updates).
>>>>>>>>>>>>>>>>>> This is shown in the fact that Jack uses the same
>>>>>>>>>>>>>>>>>> operation to update the storage table for Option 1 and 3:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> // REST: POST
>>>>>>>>>>>>>>>>>> /namespaces/db1/tables/mv1?materializedView=true
>>>>>>>>>>>>>>>>>> // non-REST: update JSON files at table_metadata_location
>>>>>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The open question is *whether the create and load
>>>>>>>>>>>>>>>>>> methods should treat the properties that constitute the MV 
>>>>>>>>>>>>>>>>>> metadata as two
>>>>>>>>>>>>>>>>>> entities (View + Table) or one entity (new MV object)*.
>>>>>>>>>>>>>>>>>> This is all part of Jack's proposal, where Option 1 proposes 
>>>>>>>>>>>>>>>>>> a new MV
>>>>>>>>>>>>>>>>>> object, and Option 3 proposes two separate entities. The 
>>>>>>>>>>>>>>>>>> advantage of
>>>>>>>>>>>>>>>>>> Option 1 is that it doesn't require two operations to load 
>>>>>>>>>>>>>>>>>> the metadata. On
>>>>>>>>>>>>>>>>>> the other hand, the advantage of Option 3 is that no new 
>>>>>>>>>>>>>>>>>> operations or
>>>>>>>>>>>>>>>>>> catalogs have to be defined.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In my opinion, defining a new representation for
>>>>>>>>>>>>>>>>>> materialized views (Option 1) is generally the cleaner 
>>>>>>>>>>>>>>>>>> solution. However, I
>>>>>>>>>>>>>>>>>> see a path where we could first introduce Option 3 and still 
>>>>>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>>>>> possibility to transition to Option 1 if needed. The great 
>>>>>>>>>>>>>>>>>> thing about
>>>>>>>>>>>>>>>>>> Option 3 is that it only requires minor changes to the 
>>>>>>>>>>>>>>>>>> current spec and is
>>>>>>>>>>>>>>>>>> mostly implementation detail.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Therefore I would propose small additions to Jacks Option
>>>>>>>>>>>>>>>>>> 3 that only introduce changes to the spec that are not 
>>>>>>>>>>>>>>>>>> specific to
>>>>>>>>>>>>>>>>>> materialized views. The idea is to introduce boolean 
>>>>>>>>>>>>>>>>>> properties to be set
>>>>>>>>>>>>>>>>>> on the creation of the view and the storage table that 
>>>>>>>>>>>>>>>>>> indicate that they
>>>>>>>>>>>>>>>>>> belong to a materialized view. The view property 
>>>>>>>>>>>>>>>>>> "materialized" is set to
>>>>>>>>>>>>>>>>>> "true" for a MV and "false" for a regular view. And the 
>>>>>>>>>>>>>>>>>> table property
>>>>>>>>>>>>>>>>>> "storage_table" is set to "true" for a storage table and 
>>>>>>>>>>>>>>>>>> "false" for a
>>>>>>>>>>>>>>>>>> regular table. The absence of these properties indicates a 
>>>>>>>>>>>>>>>>>> regular view or
>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>>>>>>>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>>>>>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1",
>>>>>>>>>>>>>>>>>> "mv1"));
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>>>>>>>>>>>>> // non-REST: load JSON file at table_metadata_location if
>>>>>>>>>>>>>>>>>> present
>>>>>>>>>>>>>>>>>> Table storageTable = view.storageTable();
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>>>>>>>>>>>>> // non-REST: update JSON file at table_metadata_location
>>>>>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We could then introduce a new requirement for views and
>>>>>>>>>>>>>>>>>> tables called "AssertProperty" which could make sure to only 
>>>>>>>>>>>>>>>>>> perform
>>>>>>>>>>>>>>>>>> updates that are inline with materialized views. The 
>>>>>>>>>>>>>>>>>> additional requirement
>>>>>>>>>>>>>>>>>> can be seen as a general extension which does not need to be 
>>>>>>>>>>>>>>>>>> changed if we
>>>>>>>>>>>>>>>>>> decide to got with Option 1 in the future.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Let me know what you think.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks Ryan for the insights. I agree that reusing
>>>>>>>>>>>>>>>>>> existing metadata definitions and minimizing spec changes 
>>>>>>>>>>>>>>>>>> are very
>>>>>>>>>>>>>>>>>> important. This also minimizes spec drift (between 
>>>>>>>>>>>>>>>>>> materialized views and
>>>>>>>>>>>>>>>>>> views spec, and between materialized views and tables spec), 
>>>>>>>>>>>>>>>>>> and simplifies
>>>>>>>>>>>>>>>>>> the implementation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In an effort to take the discussion forward with concrete
>>>>>>>>>>>>>>>>>> design options based on an end-to-end implementation, I have 
>>>>>>>>>>>>>>>>>> prototyped the
>>>>>>>>>>>>>>>>>> implementation (and added Spark support) in this PR
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it
>>>>>>>>>>>>>>>>>> helps us reach convergence faster. More details about some 
>>>>>>>>>>>>>>>>>> of the design
>>>>>>>>>>>>>>>>>> options are discussed in the description of the PR.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I mean separate table and view metadata that is somehow
>>>>>>>>>>>>>>>>>>> combined through a commit process. For instance, keeping a 
>>>>>>>>>>>>>>>>>>> pointer to a
>>>>>>>>>>>>>>>>>>> table metadata file in a view metadata file or combining 
>>>>>>>>>>>>>>>>>>> commits to
>>>>>>>>>>>>>>>>>>> reference both. I don't see the value in either option.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks Ryan for the help to trace back to the root
>>>>>>>>>>>>>>>>>>>> question! Just a clarification question regarding your 
>>>>>>>>>>>>>>>>>>>> reply before I reply
>>>>>>>>>>>>>>>>>>>> further: what exactly does the option "a combination of 
>>>>>>>>>>>>>>>>>>>> the two (i.e.
>>>>>>>>>>>>>>>>>>>> commits are combined)" mean? How is that different from "a 
>>>>>>>>>>>>>>>>>>>> new metadata
>>>>>>>>>>>>>>>>>>>> type"?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to