Re: Materialized view integration with REST spec

Szehon Ho Fri, 22 Mar 2024 10:35:27 -0700

Hi

My understanding was last time it was still unresolved, and the action item
was on Jack and/or/ Jan to make a shorter document.  I think the debate now
has boiled down to Ryan's three options:


   1. separate table/view
   2. combination of table/view tied together via commit
   3. new metadata type

 with probably the first and third being the main contenders. My
understanding was we wanted a table of pros/cons between (1) and (3),
presumably giving folks a chance to address the cons, before the next
meeting.

Jack (main proponent of option (3) just went on paternity leave, so not
sure if there was someone from Amazon with some context of Jack's thought
to continue that train of thought though?  Otherwise maybe Jan can give it
a shot?  Else I will be out and can't make the next iceberg sync, but can
prepare one for the one after that, if needed.

Re: 'new' proposal', not sure if we are ready for a formal one, given the
deadlock between the two options, but Im open to that as well to make a
proposal based on one of the options above.  What do folks think?

Thanks,
Szehon

On Fri, Mar 22, 2024 at 3:15 AM Renjie Liu <[email protected]> wrote:

> +1
>
> On Fri, Mar 22, 2024 at 16:42 Jean-Baptiste Onofré <[email protected]>
> wrote:
>
>> Hi Renjie,
>>
>> We discussed the MV proposal, without yet reaching any conclusion.
>>
>> I propose:
>> - to use the "new" proposal process in place (creating an GH issue with
>> proposal flag, with link to the document)
>> - use the document and/or GH issue to add comments
>> - finalize the document heading to a vote (to get consensus)
>>
>> Thoughts ?
>>
>> NB: I will follow up with "stale PR/proposal" PR to be sure we are moving
>> forward ;)
>>
>> Regards
>> JB
>>
>> On Fri, Mar 22, 2024 at 4:29 AM Renjie Liu <[email protected]>
>> wrote:
>>
>>> Hi:
>>>
>>> Sorry I didn't make it to join the last community sync. Did we reach any
>>> conclusion about mv spec?
>>>
>>> On Tue, Mar 5, 2024 at 11:28 PM himadri pal <[email protected]> wrote:
>>>
>>>> For me the calendar link did not work in mobile, but I was able to add
>>>> the dev Google calendar from
>>>> https://iceberg.apache.org/community/#iceberg-community-events by
>>>> accessing it from  laptop.
>>>>
>>>> Regards,
>>>> Himadri Pal
>>>>
>>>>
>>>> On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Jack! I think the images are stripped from the message, but
>>>>> they are there on the doc
>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>>>  if
>>>>> someone wants to check them out (I have left some comments while there).
>>>>>
>>>>> Also I no longer see the community sync calendar
>>>>> https://iceberg.apache.org/community/#slack, so it is unclear when
>>>>> the meeting is (and we do not have the link).
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Mon, Mar 4, 2024 at 9:58 AM Jack Ye <[email protected]> wrote:
>>>>>
>>>>>> Thanks Jan! +1 for everyone to take a look before the discussion, and
>>>>>> see if there are any missing options or major arguments.
>>>>>>
>>>>>> I have also added the images regarding all the options, it might be
>>>>>> easier to parse than the big sheet. I will also put it here for people 
>>>>>> that
>>>>>> do not have time to read through it:
>>>>>>
>>>>>>
>>>>>> *Option 1: Add storage table identifier in view metadata content*
>>>>>>
>>>>>> [image: MV option 1.png]
>>>>>> *Option 2: Add storage table metadata file pointer in view object*
>>>>>>
>>>>>> [image: MV option 2.png]
>>>>>> *Option 3: Add storage table metadata file pointer in view metadata
>>>>>> content*
>>>>>>
>>>>>> [image: MV option 3.png]
>>>>>>
>>>>>> *Option 4: Embed table metadata in view metadata content*
>>>>>>
>>>>>> [image: MV option 4.png]
>>>>>> *Option 5: New MV spec, MV object has table and view metadata file
>>>>>> pointers*
>>>>>>
>>>>>> [image: MV option 5.png]
>>>>>> *Option 6: New MV spec, MV metadata content embeds table and view
>>>>>> metadata*
>>>>>>
>>>>>> [image: MV option 6.png]
>>>>>> *Option 7: New MV spec, completely new MV metadata content*
>>>>>>
>>>>>> [image: MV option 7.png]
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I think it's great to have a face to face discussion about this.
>>>>>>> Additionally, I would propose to use Jacks' document
>>>>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>>>>> as a common ground for the discussion and that everyone has a quick look
>>>>>>> before the next community sync. If you think the document is still 
>>>>>>> missing
>>>>>>> some arguments, please make suggestions to add them. This way we have to
>>>>>>> spend less time to get everyone up to speed and have a more common
>>>>>>> terminology.
>>>>>>>
>>>>>>> Looking forward to the discussion, best wishes
>>>>>>>
>>>>>>> Jan
>>>>>>> On 02.03.24 02:06, Walaa Eldin Moustafa wrote:
>>>>>>>
>>>>>>> The calendar on the site is currently broken
>>>>>>> https://iceberg.apache.org/community/#iceberg-community-events.
>>>>>>> Might help to fix it or share the meeting link here.
>>>>>>>
>>>>>>> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <[email protected]> wrote:
>>>>>>>
>>>>>>>> Sounds good, let's discuss this in person!
>>>>>>>>
>>>>>>>> I am a bit worried that we have quite a few critical topics going
>>>>>>>> on right now on devlist, and this will take up a lot of time to 
>>>>>>>> discuss. If
>>>>>>>> it ends up going for too long, l propose let us have a dedicated 
>>>>>>>> meeting,
>>>>>>>> and I am more than happy to organize it.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hey everyone,
>>>>>>>>>
>>>>>>>>> I think this thread has hit a point of diminishing returns and
>>>>>>>>> that we still don't have a common understanding of what the options 
>>>>>>>>> under
>>>>>>>>> consideration actually are.
>>>>>>>>>
>>>>>>>>> Since we were already planning on discussing this at the next
>>>>>>>>> community sync, I suggest we pick this up there and use that time to 
>>>>>>>>> align
>>>>>>>>> on what exactly we're considering. We can then start a new thread to 
>>>>>>>>> lay
>>>>>>>>> out the designs under consideration in more detail and then have a
>>>>>>>>> discussion about trade-offs.
>>>>>>>>>
>>>>>>>>> Does that sound reasonable?
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I am finding it hard to interpret the options concretely. I would
>>>>>>>>>> also suggest breaking the expectation/outcome to milestones. Maybe it
>>>>>>>>>> becomes easier if we agree to distinguish between an approach that is
>>>>>>>>>> feasible in the near term and another in the long term, especially 
>>>>>>>>>> if the
>>>>>>>>>> latter requires significant engine-side changes.
>>>>>>>>>>
>>>>>>>>>> Further, maybe it helps if we start with an option that fully
>>>>>>>>>> reuses the existing spec, and see how we view it in comparison with 
>>>>>>>>>> the
>>>>>>>>>> options discussed previously. I am sharing one below. It reuses the 
>>>>>>>>>> current
>>>>>>>>>> spec of Iceberg views and tables by leveraging table properties to 
>>>>>>>>>> capture
>>>>>>>>>> materialized view metadata. What is common (and not common) between 
>>>>>>>>>> this
>>>>>>>>>> and the desired representations?
>>>>>>>>>>
>>>>>>>>>> The new properties are:
>>>>>>>>>> Properties on a View:
>>>>>>>>>>
>>>>>>>>>>    1.
>>>>>>>>>>
>>>>>>>>>>    *iceberg.materialized.view*:
>>>>>>>>>>    - *Type*: View property
>>>>>>>>>>       - *Purpose*: This property is used to mark whether a view
>>>>>>>>>>       is a materialized view. If set to true, the view is
>>>>>>>>>>       treated as a materialized view. This helps in differentiating 
>>>>>>>>>> between
>>>>>>>>>>       virtual and materialized views within the catalog and dictates 
>>>>>>>>>> specific
>>>>>>>>>>       handling and validation logic for materialized views.
>>>>>>>>>>    2.
>>>>>>>>>>
>>>>>>>>>>    *iceberg.materialized.view.storage.location*:
>>>>>>>>>>    - *Type*: View property
>>>>>>>>>>       - *Purpose*: Specifies the location of the storage table
>>>>>>>>>>       associated with the materialized view. This property is used 
>>>>>>>>>> for linking a
>>>>>>>>>>       materialized view with its corresponding storage table, 
>>>>>>>>>> enabling data
>>>>>>>>>>       management and query execution based on the stored data 
>>>>>>>>>> freshness.
>>>>>>>>>>
>>>>>>>>>> Properties on a Table:
>>>>>>>>>>
>>>>>>>>>>    1. *base.snapshot.[UUID]*:
>>>>>>>>>>       - *Type*: Table property
>>>>>>>>>>       - *Purpose*: These properties store the snapshot IDs of
>>>>>>>>>>       the base tables at the time the materialized view's data was 
>>>>>>>>>> last updated.
>>>>>>>>>>       Each property is prefixed with base.snapshot. followed by
>>>>>>>>>>       the UUID of the base table. They are used to track whether the 
>>>>>>>>>> materialized
>>>>>>>>>>       view's data is up to date with the base tables by comparing 
>>>>>>>>>> these snapshot
>>>>>>>>>>       IDs with the current snapshot IDs of the base tables. If all 
>>>>>>>>>> the base
>>>>>>>>>>       tables' current snapshot IDs match the ones stored in these 
>>>>>>>>>> properties, the
>>>>>>>>>>       materialized view's data is considered fresh.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> > All of these approaches are aligned in one, specific way: the
>>>>>>>>>>> storage table is an iceberg table.
>>>>>>>>>>>
>>>>>>>>>>> I do not think that is true. I think people are aligned that we
>>>>>>>>>>> would like to re-use the Iceberg table metadata defined in the 
>>>>>>>>>>> Iceberg
>>>>>>>>>>> table spec to express the data in MV, but I don't think it goes 
>>>>>>>>>>> that far to
>>>>>>>>>>> say it must be an Iceberg table. Once you have that mindset, then 
>>>>>>>>>>> of course
>>>>>>>>>>> option 1 (separate table and view) is the only option.
>>>>>>>>>>>
>>>>>>>>>>> > I don't think that is necessary and it significantly increases
>>>>>>>>>>> the complexity.
>>>>>>>>>>>
>>>>>>>>>>> And can you quantify what you mean by "significantly increases
>>>>>>>>>>> the complexity"? Seems like a lot of concerns are coming from the 
>>>>>>>>>>> tradeoff
>>>>>>>>>>> with complexity. We probably all agree that using option 7 (a 
>>>>>>>>>>> completely
>>>>>>>>>>> new metadata type) is a lot of work from scratch, that is why it is 
>>>>>>>>>>> not
>>>>>>>>>>> favored. However, my understanding is that as long as we re-use the 
>>>>>>>>>>> view
>>>>>>>>>>> and table metadata, then the majority of the existing logic can be 
>>>>>>>>>>> reused.
>>>>>>>>>>> I think what we have gone through in Slack to draft the rough Java 
>>>>>>>>>>> API
>>>>>>>>>>> shape helps here, because people can estimate the amount of effort 
>>>>>>>>>>> required
>>>>>>>>>>> to implement it. And I don't think they are **significantly** more 
>>>>>>>>>>> complex
>>>>>>>>>>> to implement. Could you elaborate more about the complexity that you
>>>>>>>>>>> imagine?
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I feel I've been most vocal about pushing back against options
>>>>>>>>>>>> 2+ (or Ryan's categories of combined table/view, or new metadata 
>>>>>>>>>>>> type), so
>>>>>>>>>>>> I'll try to expand on my reasoning.
>>>>>>>>>>>>
>>>>>>>>>>>> I understand the appeal of creating a design where we
>>>>>>>>>>>> encapsulate the view/storage from both a structural and performance
>>>>>>>>>>>> standpoint, but I don't think that is necessary and it
>>>>>>>>>>>> significantly increases the complexity.
>>>>>>>>>>>>
>>>>>>>>>>>> All of these approaches are aligned in one, specific way: the
>>>>>>>>>>>> storage table is an iceberg table.
>>>>>>>>>>>>
>>>>>>>>>>>> Because of this, all the behaviors and requirements still apply
>>>>>>>>>>>> to these tables.  They need to be maintained (snapshot cleanup, 
>>>>>>>>>>>> orphan
>>>>>>>>>>>> files), in cases need to be optimized (compaction, manifest 
>>>>>>>>>>>> rewrites), they
>>>>>>>>>>>> need to be able to be inspected (this will be even more important 
>>>>>>>>>>>> with MV
>>>>>>>>>>>> since staleness can produce different results and questions will 
>>>>>>>>>>>> arise
>>>>>>>>>>>> about what state the storage table was in).  There may be cases 
>>>>>>>>>>>> where the
>>>>>>>>>>>> tables need to be managed directly.
>>>>>>>>>>>>
>>>>>>>>>>>> Anywhere we deviate from the existing constructs/commit/access
>>>>>>>>>>>> for tables, we will ultimately have to then unwrap to re-expose the
>>>>>>>>>>>> underlying Iceberg behavior.  This creates unnecessary complexity 
>>>>>>>>>>>> in the
>>>>>>>>>>>> library/API layer, which are not the primary interface users will 
>>>>>>>>>>>> have with
>>>>>>>>>>>> materialized views where an engine is almost entirely necessary to 
>>>>>>>>>>>> interact
>>>>>>>>>>>> with the dataset.
>>>>>>>>>>>>
>>>>>>>>>>>> As to the performance concerns around option 1, I think we're
>>>>>>>>>>>> overstating the downsides.  It really comes down to how many 
>>>>>>>>>>>> metadata loads
>>>>>>>>>>>> are necessary and evaluating freshness would likely be the real 
>>>>>>>>>>>> bottleneck
>>>>>>>>>>>> as it involves potentially loading many tables.  All of the 
>>>>>>>>>>>> options are on
>>>>>>>>>>>> the same order of performance for the metadata and table loads.
>>>>>>>>>>>>
>>>>>>>>>>>> As to the visibility of tables and whether they're registered
>>>>>>>>>>>> in the catalog, I think registering in the catalog is the right 
>>>>>>>>>>>> approach so
>>>>>>>>>>>> that the tables are still addressable for maintenance/etc.  The 
>>>>>>>>>>>> visibility
>>>>>>>>>>>> of the storage table is a catalog implementation decision and 
>>>>>>>>>>>> shouldn't be
>>>>>>>>>>>> a requirement of the MV spec (I can see cases for both and it isn't
>>>>>>>>>>>> necessary to dictate a behavior).
>>>>>>>>>>>>
>>>>>>>>>>>> I'm still strongly in favor of Option 1 (separate table and
>>>>>>>>>>>> view) for these reasons.
>>>>>>>>>>>>
>>>>>>>>>>>> -Dan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> > Jack, it sounds like you’re the proponent of a combined
>>>>>>>>>>>>> table and view (rather than a new metadata spec for a 
>>>>>>>>>>>>> materialized view).
>>>>>>>>>>>>> What is the main motivation? It seems like you’re convinced of 
>>>>>>>>>>>>> that
>>>>>>>>>>>>> approach, but I don’t understand the advantage it brings.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry I have to make a Google Sheet to capture all the options
>>>>>>>>>>>>> we have discussed so far, I wanted to use the existing Google 
>>>>>>>>>>>>> Doc, but it
>>>>>>>>>>>>> has really bad table/sheet support...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have listed all the options, with how they are implemented
>>>>>>>>>>>>> and some important considerations we have discussed so far. Note 
>>>>>>>>>>>>> that:
>>>>>>>>>>>>> 1. This sheet currently excludes the lineage information,
>>>>>>>>>>>>> which we can discuss more later after the current topic is 
>>>>>>>>>>>>> resolved.
>>>>>>>>>>>>> 2. I removed the considerations for REST integration since
>>>>>>>>>>>>> from the other thread we have clarified that they should be 
>>>>>>>>>>>>> considered
>>>>>>>>>>>>> completely separately.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Why I come as a proponent of having a new MV object with
>>>>>>>>>>>>> table and view metadata file pointer*
>>>>>>>>>>>>>
>>>>>>>>>>>>> In my sheet, there are 3 options that do not have major
>>>>>>>>>>>>> problems:
>>>>>>>>>>>>> Option 2: Add storage table metadata file pointer in view
>>>>>>>>>>>>> object
>>>>>>>>>>>>> Option 5: New MV object with table and view metadata file
>>>>>>>>>>>>> pointer
>>>>>>>>>>>>> Option 6: New MV spec with table and view metadata
>>>>>>>>>>>>>
>>>>>>>>>>>>> I originally excluded option 2 because I think it does not
>>>>>>>>>>>>> align with the REST spec, but after the other discussion thread 
>>>>>>>>>>>>> about "Inconsistency
>>>>>>>>>>>>> between REST spec and table/view spec", I think my original 
>>>>>>>>>>>>> concern no
>>>>>>>>>>>>> longer holds true so now I put it back. And based on my
>>>>>>>>>>>>> personal preference that MV is an independent object that should 
>>>>>>>>>>>>> be
>>>>>>>>>>>>> separated from view and table, plus the fact that option 5 is 
>>>>>>>>>>>>> probably less
>>>>>>>>>>>>> work than option 6 for implementation, that is how I come as a 
>>>>>>>>>>>>> proponent of
>>>>>>>>>>>>> option 5 at this moment.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Regarding Ryan's evaluation framework *
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think we need to reconcile this sheet with Ryan's evaluation
>>>>>>>>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 
>>>>>>>>>>>>> 6 all
>>>>>>>>>>>>> under the same category of "A combination of a view and a
>>>>>>>>>>>>> table" and concludes that they don't have any advantage for the 
>>>>>>>>>>>>> same set of
>>>>>>>>>>>>> reasons. But those reasons are not really convincing to me so 
>>>>>>>>>>>>> let's talk
>>>>>>>>>>>>> about them in more detail.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (1) You said "I don’t see a reason why a combined view and
>>>>>>>>>>>>> table is advantageous" as "this would cause unnecessary 
>>>>>>>>>>>>> dependence between
>>>>>>>>>>>>> the view and table in catalogs."  What dependency exactly do you 
>>>>>>>>>>>>> mean here?
>>>>>>>>>>>>> And why is that unnecessary, given there has to be some sort of 
>>>>>>>>>>>>> dependency
>>>>>>>>>>>>> anyway unless we go with option 5 or 6?
>>>>>>>>>>>>>
>>>>>>>>>>>>> (2) You said "I guess there’s an argument that you could load
>>>>>>>>>>>>> both table and view metadata locations at the same time. That 
>>>>>>>>>>>>> hardly seems
>>>>>>>>>>>>> worth the trouble". I disagree with that. Catalog interaction 
>>>>>>>>>>>>> performance
>>>>>>>>>>>>> is critical to at least everyone working in EMR and Athena, and 
>>>>>>>>>>>>> MV itself
>>>>>>>>>>>>> as an acceleration approach needs to be as fast as possible.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have put 3 key operations in the doc that I think matters
>>>>>>>>>>>>> for MV during interactions with engine:
>>>>>>>>>>>>> 1. refreshes storage table
>>>>>>>>>>>>> 2. get the storage table of the MV
>>>>>>>>>>>>> 3. if stale, get the view SQL
>>>>>>>>>>>>>
>>>>>>>>>>>>> And option 1 clearly falls short with 4 sequential steps
>>>>>>>>>>>>> required to load a storage table. You mentioned "recent issues 
>>>>>>>>>>>>> with adding
>>>>>>>>>>>>> views to the JDBC catalog" in this topic, could you explain a bit 
>>>>>>>>>>>>> more?
>>>>>>>>>>>>>
>>>>>>>>>>>>> (3) You said "I also think that once we decide on structure,
>>>>>>>>>>>>> we can make it possible for REST catalog implementations to do 
>>>>>>>>>>>>> smart
>>>>>>>>>>>>> things, in a way that doesn’t put additional requirements on the 
>>>>>>>>>>>>> underlying
>>>>>>>>>>>>> catalog store." If REST is fully compatible with Iceberg spec 
>>>>>>>>>>>>> then I have
>>>>>>>>>>>>> no problem with this statement. However, as we discussed in the 
>>>>>>>>>>>>> other
>>>>>>>>>>>>> thread, it is not the case. In the current state, I think the 
>>>>>>>>>>>>> sequence of
>>>>>>>>>>>>> action should be to evolve the Iceberg table/view spec (or add a 
>>>>>>>>>>>>> MV spec)
>>>>>>>>>>>>> first, and then think about how REST can incorporate it or do 
>>>>>>>>>>>>> smart things
>>>>>>>>>>>>> that are not Iceberg spec compliant. Do you agree with that?
>>>>>>>>>>>>>
>>>>>>>>>>>>> (4) You said the table identifier pointer "is a problem we
>>>>>>>>>>>>> need to solve generally because a materialized table needs to be 
>>>>>>>>>>>>> able to
>>>>>>>>>>>>> track the upstream state of tables that were used". I don't think 
>>>>>>>>>>>>> that is a
>>>>>>>>>>>>> reason to choose to use a table identifier pointer for a storage 
>>>>>>>>>>>>> table. The
>>>>>>>>>>>>> issue is not about using a table identifier pointer. It is about 
>>>>>>>>>>>>> exposing
>>>>>>>>>>>>> the storage table as a separate entity in the catalog, which is 
>>>>>>>>>>>>> what people
>>>>>>>>>>>>> do not like and is already discussed in length in Jan's question 
>>>>>>>>>>>>> 3 (also
>>>>>>>>>>>>> linked in the sheet). I agree with that statement, because 
>>>>>>>>>>>>> without a REST
>>>>>>>>>>>>> implementation that can magically hide the storage table, this 
>>>>>>>>>>>>> model adds
>>>>>>>>>>>>> additional burden regarding compliance and data governance for 
>>>>>>>>>>>>> any other
>>>>>>>>>>>>> non-REST catalog implementations that are compliant to the 
>>>>>>>>>>>>> Iceberg spec.
>>>>>>>>>>>>> Many mechanisms need to be built in a catalog to hide, protect, 
>>>>>>>>>>>>> maintain,
>>>>>>>>>>>>> recycle the storage table, that can be avoided by using other 
>>>>>>>>>>>>> approaches. I
>>>>>>>>>>>>> think we should reach a consensus about that and discuss further 
>>>>>>>>>>>>> if you do
>>>>>>>>>>>>> not agree.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul
>>>>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Ryan, we actually discussed your categories in this
>>>>>>>>>>>>>> question
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>>>>>>>>>>> Where your categories correspond to the following designs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Separate table and view => Design 1
>>>>>>>>>>>>>>    - Combination of view and table => Design 2
>>>>>>>>>>>>>>    - A new metadata type => Design 4
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looks like it wasn’t clear what I meant for the 3 categories,
>>>>>>>>>>>>>> so I’ll be more specific:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - *Separate table and view*: this option is to have the
>>>>>>>>>>>>>>    objects that we have today, with extra metadata. Commit 
>>>>>>>>>>>>>> processes are
>>>>>>>>>>>>>>    separate: committing to the table doesn’t alter the view and 
>>>>>>>>>>>>>> committing to
>>>>>>>>>>>>>>    the view doesn’t change the table. However, changing the view 
>>>>>>>>>>>>>> can make it
>>>>>>>>>>>>>>    so the table is no longer useful as a materialization.
>>>>>>>>>>>>>>    - *A combination of a view and a table*: in this option,
>>>>>>>>>>>>>>    the table metadata and view metadata are the same as the 
>>>>>>>>>>>>>> first option. The
>>>>>>>>>>>>>>    difference is that the commit process combines them, either 
>>>>>>>>>>>>>> by embedding a
>>>>>>>>>>>>>>    table metadata location in view metadata or by tracking both 
>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>    catalog reference.
>>>>>>>>>>>>>>    - *A new metadata type*: this option is where we define a
>>>>>>>>>>>>>>    new metadata object that has view attributes, like SQL 
>>>>>>>>>>>>>> representations,
>>>>>>>>>>>>>>    along with table attributes, like partition specs and 
>>>>>>>>>>>>>> snapshots.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hopefully this is clear because I think much of the confusion
>>>>>>>>>>>>>> is caused by different definitions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The LoadTableResponse having optional metadata-location field
>>>>>>>>>>>>>> implies that the object in the catalog no longer needs to hold a 
>>>>>>>>>>>>>> metadata
>>>>>>>>>>>>>> file pointer
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The REST protocol has not removed the requirement for a
>>>>>>>>>>>>>> metadata file, so I’m going to keep focused on the MV design 
>>>>>>>>>>>>>> options.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When we say a MV can be a “new metadata type”, it does not
>>>>>>>>>>>>>> mean it needs to define a completely brand new structure of the 
>>>>>>>>>>>>>> metadata
>>>>>>>>>>>>>> content
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I’m making a distinction between separate metadata files for
>>>>>>>>>>>>>> the table and the view and a combined metadata object, as above.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We can define an “Iceberg MV” to be an object in a catalog,
>>>>>>>>>>>>>> which has 1 table metadata file pointer, and 1 view metadata 
>>>>>>>>>>>>>> file pointer
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is the option I am referring to as a “combination of a
>>>>>>>>>>>>>> view and a table”.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So to review my initial email, I don’t see a reason why a
>>>>>>>>>>>>>> combined view and table is advantageous, either implemented by 
>>>>>>>>>>>>>> having a
>>>>>>>>>>>>>> catalog reference with two metadata locations or embedding a 
>>>>>>>>>>>>>> table metadata
>>>>>>>>>>>>>> location in view metadata. This would cause unnecessary 
>>>>>>>>>>>>>> dependence between
>>>>>>>>>>>>>> the view and table in catalogs. I guess there’s an argument that 
>>>>>>>>>>>>>> you could
>>>>>>>>>>>>>> load both table and view metadata locations at the same time. 
>>>>>>>>>>>>>> That hardly
>>>>>>>>>>>>>> seems worth the trouble given the recent issues with adding 
>>>>>>>>>>>>>> views to the
>>>>>>>>>>>>>> JDBC catalog.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also think that once we decide on structure, we can make it
>>>>>>>>>>>>>> possible for REST catalog implementations to do smart things, in 
>>>>>>>>>>>>>> a way that
>>>>>>>>>>>>>> doesn’t put additional requirements on the underlying catalog 
>>>>>>>>>>>>>> store. For
>>>>>>>>>>>>>> instance, we could specify how to send additional objects in a
>>>>>>>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table 
>>>>>>>>>>>>>> metadata. I
>>>>>>>>>>>>>> think these optimizations are a later addition, after we define 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> relationship between views and tables.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jack, it sounds like you’re the proponent of a combined table
>>>>>>>>>>>>>> and view (rather than a new metadata spec for a materialized 
>>>>>>>>>>>>>> view). What is
>>>>>>>>>>>>>> the main motivation? It seems like you’re convinced of that 
>>>>>>>>>>>>>> approach, but I
>>>>>>>>>>>>>> don’t understand the advantage it brings.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes I mostly agree with the assessment.  To clarify a few
>>>>>>>>>>>>>>> minor points.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is a materialized view a view and a separate table, a
>>>>>>>>>>>>>>>> combination of the two (i.e. commits are combined), or a new 
>>>>>>>>>>>>>>>> metadata type?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For 'new metadata type', I consider mostly Jack's initial
>>>>>>>>>>>>>>> proposal of a new Catalog MV object that has two references 
>>>>>>>>>>>>>>> (ViewMetadata +
>>>>>>>>>>>>>>> TableMetadata).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The arguments that I see for a combined materialized view
>>>>>>>>>>>>>>>> object are:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - Regular views are separate, rather than being tables
>>>>>>>>>>>>>>>>    with SQL and no data so it would be inconsistent (“Iceberg 
>>>>>>>>>>>>>>>> view is just a
>>>>>>>>>>>>>>>>    table with no data but with representations defined. But we 
>>>>>>>>>>>>>>>> did not do
>>>>>>>>>>>>>>>>    that.”)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>>>>>>>>    materialized views
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - Tables are not typically exposed to end users — but
>>>>>>>>>>>>>>>>    this isn’t required by the separate view and table option
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For completeness, there seem to be a few additional ones
>>>>>>>>>>>>>>> (mentioned in the Slack and above messages).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - Lack of spec change (to ViewMetadata).  But as Jack
>>>>>>>>>>>>>>>    says it is a spec change (ie, to catalogs)
>>>>>>>>>>>>>>>    - A single call to get the View's StorageTable (versus
>>>>>>>>>>>>>>>    two calls)
>>>>>>>>>>>>>>>    - A more natural API, no opportunity for user to call
>>>>>>>>>>>>>>>    Catalog.dropTable() and renameTable() on storage table
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Thoughts:  *I think the long discussion sessions we had on
>>>>>>>>>>>>>>> Slack was fruitful for me, as seeing the API clarified some 
>>>>>>>>>>>>>>> things.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I was initially more in favor of MV being a new metadata
>>>>>>>>>>>>>>> type (TableMetadata + ViewMetadata).  But seeing most of the MV 
>>>>>>>>>>>>>>> operations
>>>>>>>>>>>>>>> end up being ViewCatalog or Catalog operations, I am starting 
>>>>>>>>>>>>>>> to think
>>>>>>>>>>>>>>> API-wise that it may not align with the new metadata type 
>>>>>>>>>>>>>>> (unless we define
>>>>>>>>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate 
>>>>>>>>>>>>>>> wrappers).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Initially one question I had for option 'a view and a
>>>>>>>>>>>>>>> separate table', was how to make this table reference 
>>>>>>>>>>>>>>> (metadata.json or
>>>>>>>>>>>>>>> catalog reference).  In the previous option, we had a precedent 
>>>>>>>>>>>>>>> of Catalog
>>>>>>>>>>>>>>> references to Metadata, but not pointers between Metadatas.  I 
>>>>>>>>>>>>>>> initially
>>>>>>>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 
>>>>>>>>>>>>>>> 'polluting' catalog
>>>>>>>>>>>>>>> concerns in ViewMetadata.  (I saw Catalog and ViewCatalog as a 
>>>>>>>>>>>>>>> layer above
>>>>>>>>>>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack 
>>>>>>>>>>>>>>> made a fair
>>>>>>>>>>>>>>> point that ViewMetadata already is tightly bound with a 
>>>>>>>>>>>>>>> Catalog.  In this
>>>>>>>>>>>>>>> case, I think this approach does have its merits as well in 
>>>>>>>>>>>>>>> aligning
>>>>>>>>>>>>>>> Catalog API's with the metadata.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would like to provide my perspective on the question of
>>>>>>>>>>>>>>>> what a materialized view is and elaborate on Jack's recent 
>>>>>>>>>>>>>>>> proposal to view
>>>>>>>>>>>>>>>> a materialized view as a catalog concept.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Firstly, let's look at the role of the catalog. Every
>>>>>>>>>>>>>>>> entity in the catalog has a *unique identifier*, and the
>>>>>>>>>>>>>>>> catalog provides methods to create, load, and update these 
>>>>>>>>>>>>>>>> entities. An
>>>>>>>>>>>>>>>> important thing to note is that the catalog methods exhibit 
>>>>>>>>>>>>>>>> two different
>>>>>>>>>>>>>>>> behaviors: the *create and load methods deal with the
>>>>>>>>>>>>>>>> entire entity*, while the *update(commit) method only
>>>>>>>>>>>>>>>> deals with partial changes* to the entities.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In the context of our current discussion, materialized view
>>>>>>>>>>>>>>>> (MV) metadata is a union of view and table metadata. The fact 
>>>>>>>>>>>>>>>> that the
>>>>>>>>>>>>>>>> update method deals only with partial changes, enables us to 
>>>>>>>>>>>>>>>> *reuse
>>>>>>>>>>>>>>>> the existing methods for updating tables and views*. For
>>>>>>>>>>>>>>>> updates we don't have to define what constitutes an entire 
>>>>>>>>>>>>>>>> materialized
>>>>>>>>>>>>>>>> view. Changes to a materialized view targeting the properties 
>>>>>>>>>>>>>>>> related to
>>>>>>>>>>>>>>>> the view metadata could use the update(commit) view method. 
>>>>>>>>>>>>>>>> Similarly,
>>>>>>>>>>>>>>>> changes targeting the properties related to the table metadata 
>>>>>>>>>>>>>>>> could use
>>>>>>>>>>>>>>>> the update(commit) table method. This is great news because we 
>>>>>>>>>>>>>>>> don't have
>>>>>>>>>>>>>>>> to redefine view and table commits (requirements, updates).
>>>>>>>>>>>>>>>> This is shown in the fact that Jack uses the same operation
>>>>>>>>>>>>>>>> to update the storage table for Option 1 and 3:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> // REST: POST
>>>>>>>>>>>>>>>> /namespaces/db1/tables/mv1?materializedView=true
>>>>>>>>>>>>>>>> // non-REST: update JSON files at table_metadata_location
>>>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The open question is *whether the create and load methods
>>>>>>>>>>>>>>>> should treat the properties that constitute the MV metadata as 
>>>>>>>>>>>>>>>> two entities
>>>>>>>>>>>>>>>> (View + Table) or one entity (new MV object)*. This is all
>>>>>>>>>>>>>>>> part of Jack's proposal, where Option 1 proposes a new MV 
>>>>>>>>>>>>>>>> object, and
>>>>>>>>>>>>>>>> Option 3 proposes two separate entities. The advantage of 
>>>>>>>>>>>>>>>> Option 1 is that
>>>>>>>>>>>>>>>> it doesn't require two operations to load the metadata. On the 
>>>>>>>>>>>>>>>> other hand,
>>>>>>>>>>>>>>>> the advantage of Option 3 is that no new operations or 
>>>>>>>>>>>>>>>> catalogs have to be
>>>>>>>>>>>>>>>> defined.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In my opinion, defining a new representation for
>>>>>>>>>>>>>>>> materialized views (Option 1) is generally the cleaner 
>>>>>>>>>>>>>>>> solution. However, I
>>>>>>>>>>>>>>>> see a path where we could first introduce Option 3 and still 
>>>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>>> possibility to transition to Option 1 if needed. The great 
>>>>>>>>>>>>>>>> thing about
>>>>>>>>>>>>>>>> Option 3 is that it only requires minor changes to the current 
>>>>>>>>>>>>>>>> spec and is
>>>>>>>>>>>>>>>> mostly implementation detail.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Therefore I would propose small additions to Jacks Option 3
>>>>>>>>>>>>>>>> that only introduce changes to the spec that are not specific 
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> materialized views. The idea is to introduce boolean 
>>>>>>>>>>>>>>>> properties to be set
>>>>>>>>>>>>>>>> on the creation of the view and the storage table that 
>>>>>>>>>>>>>>>> indicate that they
>>>>>>>>>>>>>>>> belong to a materialized view. The view property 
>>>>>>>>>>>>>>>> "materialized" is set to
>>>>>>>>>>>>>>>> "true" for a MV and "false" for a regular view. And the table 
>>>>>>>>>>>>>>>> property
>>>>>>>>>>>>>>>> "storage_table" is set to "true" for a storage table and 
>>>>>>>>>>>>>>>> "false" for a
>>>>>>>>>>>>>>>> regular table. The absence of these properties indicates a 
>>>>>>>>>>>>>>>> regular view or
>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>>>>>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>>>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1",
>>>>>>>>>>>>>>>> "mv1"));
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>>>>>>>>>>> // non-REST: load JSON file at table_metadata_location if
>>>>>>>>>>>>>>>> present
>>>>>>>>>>>>>>>> Table storageTable = view.storageTable();
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>>>>>>>>>>> // non-REST: update JSON file at table_metadata_location
>>>>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We could then introduce a new requirement for views and
>>>>>>>>>>>>>>>> tables called "AssertProperty" which could make sure to only 
>>>>>>>>>>>>>>>> perform
>>>>>>>>>>>>>>>> updates that are inline with materialized views. The 
>>>>>>>>>>>>>>>> additional requirement
>>>>>>>>>>>>>>>> can be seen as a general extension which does not need to be 
>>>>>>>>>>>>>>>> changed if we
>>>>>>>>>>>>>>>> decide to got with Option 1 in the future.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Let me know what you think.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing
>>>>>>>>>>>>>>>> metadata definitions and minimizing spec changes are very 
>>>>>>>>>>>>>>>> important. This
>>>>>>>>>>>>>>>> also minimizes spec drift (between materialized views and 
>>>>>>>>>>>>>>>> views spec, and
>>>>>>>>>>>>>>>> between materialized views and tables spec), and simplifies the
>>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In an effort to take the discussion forward with concrete
>>>>>>>>>>>>>>>> design options based on an end-to-end implementation, I have 
>>>>>>>>>>>>>>>> prototyped the
>>>>>>>>>>>>>>>> implementation (and added Spark support) in this PR
>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it
>>>>>>>>>>>>>>>> helps us reach convergence faster. More details about some of 
>>>>>>>>>>>>>>>> the design
>>>>>>>>>>>>>>>> options are discussed in the description of the PR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I mean separate table and view metadata that is somehow
>>>>>>>>>>>>>>>>> combined through a commit process. For instance, keeping a 
>>>>>>>>>>>>>>>>> pointer to a
>>>>>>>>>>>>>>>>> table metadata file in a view metadata file or combining 
>>>>>>>>>>>>>>>>> commits to
>>>>>>>>>>>>>>>>> reference both. I don't see the value in either option.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks Ryan for the help to trace back to the root
>>>>>>>>>>>>>>>>>> question! Just a clarification question regarding your reply 
>>>>>>>>>>>>>>>>>> before I reply
>>>>>>>>>>>>>>>>>> further: what exactly does the option "a combination of the 
>>>>>>>>>>>>>>>>>> two (i.e.
>>>>>>>>>>>>>>>>>> commits are combined)" mean? How is that different from "a 
>>>>>>>>>>>>>>>>>> new metadata
>>>>>>>>>>>>>>>>>> type"?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to