Re: Materialized view integration with REST spec

Renjie Liu Thu, 21 Mar 2024 20:30:10 -0700

Hi:

Sorry I didn't make it to join the last community sync. Did we reach any
conclusion about mv spec?


On Tue, Mar 5, 2024 at 11:28 PM himadri pal <[email protected]> wrote:

> For me the calendar link did not work in mobile, but I was able to add the
> dev Google calendar from
> https://iceberg.apache.org/community/#iceberg-community-events by
> accessing it from  laptop.
>
> Regards,
> Himadri Pal
>
>
> On Mon, Mar 4, 2024 at 4:43 PM Walaa Eldin Moustafa <[email protected]>
> wrote:
>
>> Thanks Jack! I think the images are stripped from the message, but they
>> are there on the doc
>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>  if
>> someone wants to check them out (I have left some comments while there).
>>
>> Also I no longer see the community sync calendar
>> https://iceberg.apache.org/community/#slack, so it is unclear when the
>> meeting is (and we do not have the link).
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Mon, Mar 4, 2024 at 9:58 AM Jack Ye <[email protected]> wrote:
>>
>>> Thanks Jan! +1 for everyone to take a look before the discussion, and
>>> see if there are any missing options or major arguments.
>>>
>>> I have also added the images regarding all the options, it might be
>>> easier to parse than the big sheet. I will also put it here for people that
>>> do not have time to read through it:
>>>
>>>
>>> *Option 1: Add storage table identifier in view metadata content*
>>>
>>> [image: MV option 1.png]
>>> *Option 2: Add storage table metadata file pointer in view object*
>>>
>>> [image: MV option 2.png]
>>> *Option 3: Add storage table metadata file pointer in view metadata
>>> content*
>>>
>>> [image: MV option 3.png]
>>>
>>> *Option 4: Embed table metadata in view metadata content*
>>>
>>> [image: MV option 4.png]
>>> *Option 5: New MV spec, MV object has table and view metadata file
>>> pointers*
>>>
>>> [image: MV option 5.png]
>>> *Option 6: New MV spec, MV metadata content embeds table and view
>>> metadata*
>>>
>>> [image: MV option 6.png]
>>> *Option 7: New MV spec, completely new MV metadata content*
>>>
>>> [image: MV option 7.png]
>>>
>>> -Jack
>>>
>>>
>>> On Sun, Mar 3, 2024 at 11:45 PM Jan Kaul <[email protected]>
>>> wrote:
>>>
>>>> I think it's great to have a face to face discussion about this.
>>>> Additionally, I would propose to use Jacks' document
>>>> <https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0>
>>>> as a common ground for the discussion and that everyone has a quick look
>>>> before the next community sync. If you think the document is still missing
>>>> some arguments, please make suggestions to add them. This way we have to
>>>> spend less time to get everyone up to speed and have a more common
>>>> terminology.
>>>>
>>>> Looking forward to the discussion, best wishes
>>>>
>>>> Jan
>>>> On 02.03.24 02:06, Walaa Eldin Moustafa wrote:
>>>>
>>>> The calendar on the site is currently broken
>>>> https://iceberg.apache.org/community/#iceberg-community-events. Might
>>>> help to fix it or share the meeting link here.
>>>>
>>>> On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <[email protected]> wrote:
>>>>
>>>>> Sounds good, let's discuss this in person!
>>>>>
>>>>> I am a bit worried that we have quite a few critical topics going on
>>>>> right now on devlist, and this will take up a lot of time to discuss. If 
>>>>> it
>>>>> ends up going for too long, l propose let us have a dedicated meeting, and
>>>>> I am more than happy to organize it.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> I think this thread has hit a point of diminishing returns and that
>>>>>> we still don't have a common understanding of what the options under
>>>>>> consideration actually are.
>>>>>>
>>>>>> Since we were already planning on discussing this at the next
>>>>>> community sync, I suggest we pick this up there and use that time to 
>>>>>> align
>>>>>> on what exactly we're considering. We can then start a new thread to lay
>>>>>> out the designs under consideration in more detail and then have a
>>>>>> discussion about trade-offs.
>>>>>>
>>>>>> Does that sound reasonable?
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I am finding it hard to interpret the options concretely. I would
>>>>>>> also suggest breaking the expectation/outcome to milestones. Maybe it
>>>>>>> becomes easier if we agree to distinguish between an approach that is
>>>>>>> feasible in the near term and another in the long term, especially if 
>>>>>>> the
>>>>>>> latter requires significant engine-side changes.
>>>>>>>
>>>>>>> Further, maybe it helps if we start with an option that fully reuses
>>>>>>> the existing spec, and see how we view it in comparison with the options
>>>>>>> discussed previously. I am sharing one below. It reuses the current 
>>>>>>> spec of
>>>>>>> Iceberg views and tables by leveraging table properties to capture
>>>>>>> materialized view metadata. What is common (and not common) between this
>>>>>>> and the desired representations?
>>>>>>>
>>>>>>> The new properties are:
>>>>>>> Properties on a View:
>>>>>>>
>>>>>>>    1.
>>>>>>>
>>>>>>>    *iceberg.materialized.view*:
>>>>>>>    - *Type*: View property
>>>>>>>       - *Purpose*: This property is used to mark whether a view is
>>>>>>>       a materialized view. If set to true, the view is treated as a
>>>>>>>       materialized view. This helps in differentiating between virtual 
>>>>>>> and
>>>>>>>       materialized views within the catalog and dictates specific 
>>>>>>> handling and
>>>>>>>       validation logic for materialized views.
>>>>>>>    2.
>>>>>>>
>>>>>>>    *iceberg.materialized.view.storage.location*:
>>>>>>>    - *Type*: View property
>>>>>>>       - *Purpose*: Specifies the location of the storage table
>>>>>>>       associated with the materialized view. This property is used for 
>>>>>>> linking a
>>>>>>>       materialized view with its corresponding storage table, enabling 
>>>>>>> data
>>>>>>>       management and query execution based on the stored data freshness.
>>>>>>>
>>>>>>> Properties on a Table:
>>>>>>>
>>>>>>>    1. *base.snapshot.[UUID]*:
>>>>>>>       - *Type*: Table property
>>>>>>>       - *Purpose*: These properties store the snapshot IDs of the
>>>>>>>       base tables at the time the materialized view's data was last 
>>>>>>> updated. Each
>>>>>>>       property is prefixed with base.snapshot. followed by the UUID
>>>>>>>       of the base table. They are used to track whether the 
>>>>>>> materialized view's
>>>>>>>       data is up to date with the base tables by comparing these 
>>>>>>> snapshot IDs
>>>>>>>       with the current snapshot IDs of the base tables. If all the base 
>>>>>>> tables'
>>>>>>>       current snapshot IDs match the ones stored in these properties, 
>>>>>>> the
>>>>>>>       materialized view's data is considered fresh.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <[email protected]> wrote:
>>>>>>>
>>>>>>>> > All of these approaches are aligned in one, specific way: the
>>>>>>>> storage table is an iceberg table.
>>>>>>>>
>>>>>>>> I do not think that is true. I think people are aligned that we
>>>>>>>> would like to re-use the Iceberg table metadata defined in the Iceberg
>>>>>>>> table spec to express the data in MV, but I don't think it goes that 
>>>>>>>> far to
>>>>>>>> say it must be an Iceberg table. Once you have that mindset, then of 
>>>>>>>> course
>>>>>>>> option 1 (separate table and view) is the only option.
>>>>>>>>
>>>>>>>> > I don't think that is necessary and it significantly increases
>>>>>>>> the complexity.
>>>>>>>>
>>>>>>>> And can you quantify what you mean by "significantly increases the
>>>>>>>> complexity"? Seems like a lot of concerns are coming from the tradeoff 
>>>>>>>> with
>>>>>>>> complexity. We probably all agree that using option 7 (a completely new
>>>>>>>> metadata type) is a lot of work from scratch, that is why it is not
>>>>>>>> favored. However, my understanding is that as long as we re-use the 
>>>>>>>> view
>>>>>>>> and table metadata, then the majority of the existing logic can be 
>>>>>>>> reused.
>>>>>>>> I think what we have gone through in Slack to draft the rough Java API
>>>>>>>> shape helps here, because people can estimate the amount of effort 
>>>>>>>> required
>>>>>>>> to implement it. And I don't think they are **significantly** more 
>>>>>>>> complex
>>>>>>>> to implement. Could you elaborate more about the complexity that you
>>>>>>>> imagine?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I feel I've been most vocal about pushing back against options 2+
>>>>>>>>> (or Ryan's categories of combined table/view, or new metadata type), 
>>>>>>>>> so
>>>>>>>>> I'll try to expand on my reasoning.
>>>>>>>>>
>>>>>>>>> I understand the appeal of creating a design where we encapsulate
>>>>>>>>> the view/storage from both a structural and performance standpoint, 
>>>>>>>>> but I
>>>>>>>>> don't think that is necessary and it significantly increases the 
>>>>>>>>> complexity.
>>>>>>>>>
>>>>>>>>> All of these approaches are aligned in one, specific way: the
>>>>>>>>> storage table is an iceberg table.
>>>>>>>>>
>>>>>>>>> Because of this, all the behaviors and requirements still apply to
>>>>>>>>> these tables.  They need to be maintained (snapshot cleanup, orphan 
>>>>>>>>> files),
>>>>>>>>> in cases need to be optimized (compaction, manifest rewrites), they 
>>>>>>>>> need to
>>>>>>>>> be able to be inspected (this will be even more important with MV 
>>>>>>>>> since
>>>>>>>>> staleness can produce different results and questions will arise 
>>>>>>>>> about what
>>>>>>>>> state the storage table was in).  There may be cases where the tables 
>>>>>>>>> need
>>>>>>>>> to be managed directly.
>>>>>>>>>
>>>>>>>>> Anywhere we deviate from the existing constructs/commit/access for
>>>>>>>>> tables, we will ultimately have to then unwrap to re-expose the 
>>>>>>>>> underlying
>>>>>>>>> Iceberg behavior.  This creates unnecessary complexity in the 
>>>>>>>>> library/API
>>>>>>>>> layer, which are not the primary interface users will have with
>>>>>>>>> materialized views where an engine is almost entirely necessary to 
>>>>>>>>> interact
>>>>>>>>> with the dataset.
>>>>>>>>>
>>>>>>>>> As to the performance concerns around option 1, I think we're
>>>>>>>>> overstating the downsides.  It really comes down to how many metadata 
>>>>>>>>> loads
>>>>>>>>> are necessary and evaluating freshness would likely be the real 
>>>>>>>>> bottleneck
>>>>>>>>> as it involves potentially loading many tables.  All of the options 
>>>>>>>>> are on
>>>>>>>>> the same order of performance for the metadata and table loads.
>>>>>>>>>
>>>>>>>>> As to the visibility of tables and whether they're registered in
>>>>>>>>> the catalog, I think registering in the catalog is the right approach 
>>>>>>>>> so
>>>>>>>>> that the tables are still addressable for maintenance/etc.  The 
>>>>>>>>> visibility
>>>>>>>>> of the storage table is a catalog implementation decision and 
>>>>>>>>> shouldn't be
>>>>>>>>> a requirement of the MV spec (I can see cases for both and it isn't
>>>>>>>>> necessary to dictate a behavior).
>>>>>>>>>
>>>>>>>>> I'm still strongly in favor of Option 1 (separate table and view)
>>>>>>>>> for these reasons.
>>>>>>>>>
>>>>>>>>> -Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> > Jack, it sounds like you’re the proponent of a combined table
>>>>>>>>>> and view (rather than a new metadata spec for a materialized view). 
>>>>>>>>>> What is
>>>>>>>>>> the main motivation? It seems like you’re convinced of that 
>>>>>>>>>> approach, but I
>>>>>>>>>> don’t understand the advantage it brings.
>>>>>>>>>>
>>>>>>>>>> Sorry I have to make a Google Sheet to capture all the options we
>>>>>>>>>> have discussed so far, I wanted to use the existing Google Doc, but 
>>>>>>>>>> it has
>>>>>>>>>> really bad table/sheet support...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>>>>>>>
>>>>>>>>>> I have listed all the options, with how they are implemented and
>>>>>>>>>> some important considerations we have discussed so far. Note that:
>>>>>>>>>> 1. This sheet currently excludes the lineage information, which
>>>>>>>>>> we can discuss more later after the current topic is resolved.
>>>>>>>>>> 2. I removed the considerations for REST integration since from
>>>>>>>>>> the other thread we have clarified that they should be considered
>>>>>>>>>> completely separately.
>>>>>>>>>>
>>>>>>>>>> *Why I come as a proponent of having a new MV object with table
>>>>>>>>>> and view metadata file pointer*
>>>>>>>>>>
>>>>>>>>>> In my sheet, there are 3 options that do not have major problems:
>>>>>>>>>> Option 2: Add storage table metadata file pointer in view object
>>>>>>>>>> Option 5: New MV object with table and view metadata file pointer
>>>>>>>>>> Option 6: New MV spec with table and view metadata
>>>>>>>>>>
>>>>>>>>>> I originally excluded option 2 because I think it does not align
>>>>>>>>>> with the REST spec, but after the other discussion thread about 
>>>>>>>>>> "Inconsistency
>>>>>>>>>> between REST spec and table/view spec", I think my original concern 
>>>>>>>>>> no
>>>>>>>>>> longer holds true so now I put it back. And based on my personal
>>>>>>>>>> preference that MV is an independent object that should be separated 
>>>>>>>>>> from
>>>>>>>>>> view and table, plus the fact that option 5 is probably less work 
>>>>>>>>>> than
>>>>>>>>>> option 6 for implementation, that is how I come as a proponent of 
>>>>>>>>>> option 5
>>>>>>>>>> at this moment.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Regarding Ryan's evaluation framework *
>>>>>>>>>>
>>>>>>>>>> I think we need to reconcile this sheet with Ryan's evaluation
>>>>>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 
>>>>>>>>>> all
>>>>>>>>>> under the same category of "A combination of a view and a table"
>>>>>>>>>> and concludes that they don't have any advantage for the same set of
>>>>>>>>>> reasons. But those reasons are not really convincing to me so let's 
>>>>>>>>>> talk
>>>>>>>>>> about them in more detail.
>>>>>>>>>>
>>>>>>>>>> (1) You said "I don’t see a reason why a combined view and table
>>>>>>>>>> is advantageous" as "this would cause unnecessary dependence between 
>>>>>>>>>> the
>>>>>>>>>> view and table in catalogs."  What dependency exactly do you mean 
>>>>>>>>>> here? And
>>>>>>>>>> why is that unnecessary, given there has to be some sort of 
>>>>>>>>>> dependency
>>>>>>>>>> anyway unless we go with option 5 or 6?
>>>>>>>>>>
>>>>>>>>>> (2) You said "I guess there’s an argument that you could load
>>>>>>>>>> both table and view metadata locations at the same time. That hardly 
>>>>>>>>>> seems
>>>>>>>>>> worth the trouble". I disagree with that. Catalog interaction 
>>>>>>>>>> performance
>>>>>>>>>> is critical to at least everyone working in EMR and Athena, and MV 
>>>>>>>>>> itself
>>>>>>>>>> as an acceleration approach needs to be as fast as possible.
>>>>>>>>>>
>>>>>>>>>> I have put 3 key operations in the doc that I think matters for
>>>>>>>>>> MV during interactions with engine:
>>>>>>>>>> 1. refreshes storage table
>>>>>>>>>> 2. get the storage table of the MV
>>>>>>>>>> 3. if stale, get the view SQL
>>>>>>>>>>
>>>>>>>>>> And option 1 clearly falls short with 4 sequential steps required
>>>>>>>>>> to load a storage table. You mentioned "recent issues with adding 
>>>>>>>>>> views to
>>>>>>>>>> the JDBC catalog" in this topic, could you explain a bit more?
>>>>>>>>>>
>>>>>>>>>> (3) You said "I also think that once we decide on structure, we
>>>>>>>>>> can make it possible for REST catalog implementations to do smart 
>>>>>>>>>> things,
>>>>>>>>>> in a way that doesn’t put additional requirements on the underlying 
>>>>>>>>>> catalog
>>>>>>>>>> store." If REST is fully compatible with Iceberg spec then I have no
>>>>>>>>>> problem with this statement. However, as we discussed in the other 
>>>>>>>>>> thread,
>>>>>>>>>> it is not the case. In the current state, I think the sequence of 
>>>>>>>>>> action
>>>>>>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) 
>>>>>>>>>> first,
>>>>>>>>>> and then think about how REST can incorporate it or do smart things 
>>>>>>>>>> that
>>>>>>>>>> are not Iceberg spec compliant. Do you agree with that?
>>>>>>>>>>
>>>>>>>>>> (4) You said the table identifier pointer "is a problem we need
>>>>>>>>>> to solve generally because a materialized table needs to be able to 
>>>>>>>>>> track
>>>>>>>>>> the upstream state of tables that were used". I don't think that is a
>>>>>>>>>> reason to choose to use a table identifier pointer for a storage 
>>>>>>>>>> table. The
>>>>>>>>>> issue is not about using a table identifier pointer. It is about 
>>>>>>>>>> exposing
>>>>>>>>>> the storage table as a separate entity in the catalog, which is what 
>>>>>>>>>> people
>>>>>>>>>> do not like and is already discussed in length in Jan's question 3 
>>>>>>>>>> (also
>>>>>>>>>> linked in the sheet). I agree with that statement, because without a 
>>>>>>>>>> REST
>>>>>>>>>> implementation that can magically hide the storage table, this model 
>>>>>>>>>> adds
>>>>>>>>>> additional burden regarding compliance and data governance for any 
>>>>>>>>>> other
>>>>>>>>>> non-REST catalog implementations that are compliant to the Iceberg 
>>>>>>>>>> spec.
>>>>>>>>>> Many mechanisms need to be built in a catalog to hide, protect, 
>>>>>>>>>> maintain,
>>>>>>>>>> recycle the storage table, that can be avoided by using other 
>>>>>>>>>> approaches. I
>>>>>>>>>> think we should reach a consensus about that and discuss further if 
>>>>>>>>>> you do
>>>>>>>>>> not agree.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jack Ye
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul
>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Ryan, we actually discussed your categories in this question
>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>>>>>>>> Where your categories correspond to the following designs:
>>>>>>>>>>>
>>>>>>>>>>>    - Separate table and view => Design 1
>>>>>>>>>>>    - Combination of view and table => Design 2
>>>>>>>>>>>    - A new metadata type => Design 4
>>>>>>>>>>>
>>>>>>>>>>> Jan
>>>>>>>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>>>>>>>
>>>>>>>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so
>>>>>>>>>>> I’ll be more specific:
>>>>>>>>>>>
>>>>>>>>>>>    - *Separate table and view*: this option is to have the
>>>>>>>>>>>    objects that we have today, with extra metadata. Commit 
>>>>>>>>>>> processes are
>>>>>>>>>>>    separate: committing to the table doesn’t alter the view and 
>>>>>>>>>>> committing to
>>>>>>>>>>>    the view doesn’t change the table. However, changing the view 
>>>>>>>>>>> can make it
>>>>>>>>>>>    so the table is no longer useful as a materialization.
>>>>>>>>>>>    - *A combination of a view and a table*: in this option, the
>>>>>>>>>>>    table metadata and view metadata are the same as the first 
>>>>>>>>>>> option. The
>>>>>>>>>>>    difference is that the commit process combines them, either by 
>>>>>>>>>>> embedding a
>>>>>>>>>>>    table metadata location in view metadata or by tracking both in 
>>>>>>>>>>> the same
>>>>>>>>>>>    catalog reference.
>>>>>>>>>>>    - *A new metadata type*: this option is where we define a
>>>>>>>>>>>    new metadata object that has view attributes, like SQL 
>>>>>>>>>>> representations,
>>>>>>>>>>>    along with table attributes, like partition specs and snapshots.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this is clear because I think much of the confusion is
>>>>>>>>>>> caused by different definitions.
>>>>>>>>>>>
>>>>>>>>>>> The LoadTableResponse having optional metadata-location field
>>>>>>>>>>> implies that the object in the catalog no longer needs to hold a 
>>>>>>>>>>> metadata
>>>>>>>>>>> file pointer
>>>>>>>>>>>
>>>>>>>>>>> The REST protocol has not removed the requirement for a metadata
>>>>>>>>>>> file, so I’m going to keep focused on the MV design options.
>>>>>>>>>>>
>>>>>>>>>>> When we say a MV can be a “new metadata type”, it does not mean
>>>>>>>>>>> it needs to define a completely brand new structure of the metadata 
>>>>>>>>>>> content
>>>>>>>>>>>
>>>>>>>>>>> I’m making a distinction between separate metadata files for the
>>>>>>>>>>> table and the view and a combined metadata object, as above.
>>>>>>>>>>>
>>>>>>>>>>> We can define an “Iceberg MV” to be an object in a catalog,
>>>>>>>>>>> which has 1 table metadata file pointer, and 1 view metadata file 
>>>>>>>>>>> pointer
>>>>>>>>>>>
>>>>>>>>>>> This is the option I am referring to as a “combination of a view
>>>>>>>>>>> and a table”.
>>>>>>>>>>>
>>>>>>>>>>> So to review my initial email, I don’t see a reason why a
>>>>>>>>>>> combined view and table is advantageous, either implemented by 
>>>>>>>>>>> having a
>>>>>>>>>>> catalog reference with two metadata locations or embedding a table 
>>>>>>>>>>> metadata
>>>>>>>>>>> location in view metadata. This would cause unnecessary dependence 
>>>>>>>>>>> between
>>>>>>>>>>> the view and table in catalogs. I guess there’s an argument that 
>>>>>>>>>>> you could
>>>>>>>>>>> load both table and view metadata locations at the same time. That 
>>>>>>>>>>> hardly
>>>>>>>>>>> seems worth the trouble given the recent issues with adding views 
>>>>>>>>>>> to the
>>>>>>>>>>> JDBC catalog.
>>>>>>>>>>>
>>>>>>>>>>> I also think that once we decide on structure, we can make it
>>>>>>>>>>> possible for REST catalog implementations to do smart things, in a 
>>>>>>>>>>> way that
>>>>>>>>>>> doesn’t put additional requirements on the underlying catalog 
>>>>>>>>>>> store. For
>>>>>>>>>>> instance, we could specify how to send additional objects in a
>>>>>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table 
>>>>>>>>>>> metadata. I
>>>>>>>>>>> think these optimizations are a later addition, after we define the
>>>>>>>>>>> relationship between views and tables.
>>>>>>>>>>>
>>>>>>>>>>> Jack, it sounds like you’re the proponent of a combined table
>>>>>>>>>>> and view (rather than a new metadata spec for a materialized view). 
>>>>>>>>>>> What is
>>>>>>>>>>> the main motivation? It seems like you’re convinced of that 
>>>>>>>>>>> approach, but I
>>>>>>>>>>> don’t understand the advantage it brings.
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> Yes I mostly agree with the assessment.  To clarify a few minor
>>>>>>>>>>>> points.
>>>>>>>>>>>>
>>>>>>>>>>>> is a materialized view a view and a separate table, a
>>>>>>>>>>>>> combination of the two (i.e. commits are combined), or a new 
>>>>>>>>>>>>> metadata type?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> For 'new metadata type', I consider mostly Jack's initial
>>>>>>>>>>>> proposal of a new Catalog MV object that has two references 
>>>>>>>>>>>> (ViewMetadata +
>>>>>>>>>>>> TableMetadata).
>>>>>>>>>>>>
>>>>>>>>>>>> The arguments that I see for a combined materialized view
>>>>>>>>>>>>> object are:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Regular views are separate, rather than being tables
>>>>>>>>>>>>>    with SQL and no data so it would be inconsistent (“Iceberg 
>>>>>>>>>>>>> view is just a
>>>>>>>>>>>>>    table with no data but with representations defined. But we 
>>>>>>>>>>>>> did not do
>>>>>>>>>>>>>    that.”)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>>>>>    materialized views
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Tables are not typically exposed to end users — but this
>>>>>>>>>>>>>    isn’t required by the separate view and table option
>>>>>>>>>>>>>
>>>>>>>>>>>>> For completeness, there seem to be a few additional ones
>>>>>>>>>>>> (mentioned in the Slack and above messages).
>>>>>>>>>>>>
>>>>>>>>>>>>    - Lack of spec change (to ViewMetadata).  But as Jack says
>>>>>>>>>>>>    it is a spec change (ie, to catalogs)
>>>>>>>>>>>>    - A single call to get the View's StorageTable (versus two
>>>>>>>>>>>>    calls)
>>>>>>>>>>>>    - A more natural API, no opportunity for user to call
>>>>>>>>>>>>    Catalog.dropTable() and renameTable() on storage table
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Thoughts:  *I think the long discussion sessions we had on
>>>>>>>>>>>> Slack was fruitful for me, as seeing the API clarified some things.
>>>>>>>>>>>>
>>>>>>>>>>>> I was initially more in favor of MV being a new metadata type
>>>>>>>>>>>> (TableMetadata + ViewMetadata).  But seeing most of the MV 
>>>>>>>>>>>> operations end
>>>>>>>>>>>> up being ViewCatalog or Catalog operations, I am starting to think 
>>>>>>>>>>>> API-wise
>>>>>>>>>>>> that it may not align with the new metadata type (unless we define
>>>>>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate 
>>>>>>>>>>>> wrappers).
>>>>>>>>>>>>
>>>>>>>>>>>> Initially one question I had for option 'a view and a separate
>>>>>>>>>>>> table', was how to make this table reference (metadata.json or 
>>>>>>>>>>>> catalog
>>>>>>>>>>>> reference).  In the previous option, we had a precedent of Catalog
>>>>>>>>>>>> references to Metadata, but not pointers between Metadatas.  I 
>>>>>>>>>>>> initially
>>>>>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' 
>>>>>>>>>>>> catalog
>>>>>>>>>>>> concerns in ViewMetadata.  (I saw Catalog and ViewCatalog as a 
>>>>>>>>>>>> layer above
>>>>>>>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack 
>>>>>>>>>>>> made a fair
>>>>>>>>>>>> point that ViewMetadata already is tightly bound with a Catalog.  
>>>>>>>>>>>> In this
>>>>>>>>>>>> case, I think this approach does have its merits as well in 
>>>>>>>>>>>> aligning
>>>>>>>>>>>> Catalog API's with the metadata.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Szehon
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to provide my perspective on the question of what
>>>>>>>>>>>>> a materialized view is and elaborate on Jack's recent proposal to 
>>>>>>>>>>>>> view a
>>>>>>>>>>>>> materialized view as a catalog concept.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Firstly, let's look at the role of the catalog. Every entity
>>>>>>>>>>>>> in the catalog has a *unique identifier*, and the catalog
>>>>>>>>>>>>> provides methods to create, load, and update these entities. An 
>>>>>>>>>>>>> important
>>>>>>>>>>>>> thing to note is that the catalog methods exhibit two different 
>>>>>>>>>>>>> behaviors:
>>>>>>>>>>>>> the *create and load methods deal with the entire entity*,
>>>>>>>>>>>>> while the *update(commit) method only deals with partial
>>>>>>>>>>>>> changes* to the entities.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the context of our current discussion, materialized view
>>>>>>>>>>>>> (MV) metadata is a union of view and table metadata. The fact 
>>>>>>>>>>>>> that the
>>>>>>>>>>>>> update method deals only with partial changes, enables us to 
>>>>>>>>>>>>> *reuse
>>>>>>>>>>>>> the existing methods for updating tables and views*. For
>>>>>>>>>>>>> updates we don't have to define what constitutes an entire 
>>>>>>>>>>>>> materialized
>>>>>>>>>>>>> view. Changes to a materialized view targeting the properties 
>>>>>>>>>>>>> related to
>>>>>>>>>>>>> the view metadata could use the update(commit) view method. 
>>>>>>>>>>>>> Similarly,
>>>>>>>>>>>>> changes targeting the properties related to the table metadata 
>>>>>>>>>>>>> could use
>>>>>>>>>>>>> the update(commit) table method. This is great news because we 
>>>>>>>>>>>>> don't have
>>>>>>>>>>>>> to redefine view and table commits (requirements, updates).
>>>>>>>>>>>>> This is shown in the fact that Jack uses the same operation to
>>>>>>>>>>>>> update the storage table for Option 1 and 3:
>>>>>>>>>>>>>
>>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
>>>>>>>>>>>>> // non-REST: update JSON files at table_metadata_location
>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>
>>>>>>>>>>>>> The open question is *whether the create and load methods
>>>>>>>>>>>>> should treat the properties that constitute the MV metadata as 
>>>>>>>>>>>>> two entities
>>>>>>>>>>>>> (View + Table) or one entity (new MV object)*. This is all
>>>>>>>>>>>>> part of Jack's proposal, where Option 1 proposes a new MV object, 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> Option 3 proposes two separate entities. The advantage of Option 
>>>>>>>>>>>>> 1 is that
>>>>>>>>>>>>> it doesn't require two operations to load the metadata. On the 
>>>>>>>>>>>>> other hand,
>>>>>>>>>>>>> the advantage of Option 3 is that no new operations or catalogs 
>>>>>>>>>>>>> have to be
>>>>>>>>>>>>> defined.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In my opinion, defining a new representation for materialized
>>>>>>>>>>>>> views (Option 1) is generally the cleaner solution. However, I 
>>>>>>>>>>>>> see a path
>>>>>>>>>>>>> where we could first introduce Option 3 and still have the 
>>>>>>>>>>>>> possibility to
>>>>>>>>>>>>> transition to Option 1 if needed. The great thing about Option 3 
>>>>>>>>>>>>> is that it
>>>>>>>>>>>>> only requires minor changes to the current spec and is mostly
>>>>>>>>>>>>> implementation detail.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Therefore I would propose small additions to Jacks Option 3
>>>>>>>>>>>>> that only introduce changes to the spec that are not specific to
>>>>>>>>>>>>> materialized views. The idea is to introduce boolean properties 
>>>>>>>>>>>>> to be set
>>>>>>>>>>>>> on the creation of the view and the storage table that indicate 
>>>>>>>>>>>>> that they
>>>>>>>>>>>>> belong to a materialized view. The view property "materialized" 
>>>>>>>>>>>>> is set to
>>>>>>>>>>>>> "true" for a MV and "false" for a regular view. And the table 
>>>>>>>>>>>>> property
>>>>>>>>>>>>> "storage_table" is set to "true" for a storage table and "false" 
>>>>>>>>>>>>> for a
>>>>>>>>>>>>> regular table. The absence of these properties indicates a 
>>>>>>>>>>>>> regular view or
>>>>>>>>>>>>> table.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>>>>>>>
>>>>>>>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1",
>>>>>>>>>>>>> "mv1"));
>>>>>>>>>>>>>
>>>>>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>>>>>>>> // non-REST: load JSON file at table_metadata_location if
>>>>>>>>>>>>> present
>>>>>>>>>>>>> Table storageTable = view.storageTable();
>>>>>>>>>>>>>
>>>>>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>>>>>>>> // non-REST: update JSON file at table_metadata_location
>>>>>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>>>>>
>>>>>>>>>>>>> We could then introduce a new requirement for views and tables
>>>>>>>>>>>>> called "AssertProperty" which could make sure to only perform 
>>>>>>>>>>>>> updates that
>>>>>>>>>>>>> are inline with materialized views. The additional requirement 
>>>>>>>>>>>>> can be seen
>>>>>>>>>>>>> as a general extension which does not need to be changed if we 
>>>>>>>>>>>>> decide to
>>>>>>>>>>>>> got with Option 1 in the future.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me know what you think.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jan
>>>>>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing
>>>>>>>>>>>>> metadata definitions and minimizing spec changes are very 
>>>>>>>>>>>>> important. This
>>>>>>>>>>>>> also minimizes spec drift (between materialized views and views 
>>>>>>>>>>>>> spec, and
>>>>>>>>>>>>> between materialized views and tables spec), and simplifies the
>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In an effort to take the discussion forward with concrete
>>>>>>>>>>>>> design options based on an end-to-end implementation, I have 
>>>>>>>>>>>>> prototyped the
>>>>>>>>>>>>> implementation (and added Spark support) in this PR
>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps
>>>>>>>>>>>>> us reach convergence faster. More details about some of the 
>>>>>>>>>>>>> design options
>>>>>>>>>>>>> are discussed in the description of the PR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I mean separate table and view metadata that is somehow
>>>>>>>>>>>>>> combined through a commit process. For instance, keeping a 
>>>>>>>>>>>>>> pointer to a
>>>>>>>>>>>>>> table metadata file in a view metadata file or combining commits 
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> reference both. I don't see the value in either option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Ryan for the help to trace back to the root question!
>>>>>>>>>>>>>>> Just a clarification question regarding your reply before I 
>>>>>>>>>>>>>>> reply further:
>>>>>>>>>>>>>>> what exactly does the option "a combination of the two (i.e. 
>>>>>>>>>>>>>>> commits are
>>>>>>>>>>>>>>> combined)" mean? How is that different from "a new metadata 
>>>>>>>>>>>>>>> type"?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I’m catching up on this conversation, so hopefully I can
>>>>>>>>>>>>>>>> bring a fresh perspective.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Jack already pointed out that we need to start from the
>>>>>>>>>>>>>>>> basics and I agree with that. Let’s remove voting at this 
>>>>>>>>>>>>>>>> point. Right now
>>>>>>>>>>>>>>>> is the time for discussing trade-offs, not lining up and 
>>>>>>>>>>>>>>>> taking sides. I
>>>>>>>>>>>>>>>> realize that wasn’t the intent with adding a vote, but that’s 
>>>>>>>>>>>>>>>> almost always
>>>>>>>>>>>>>>>> the result. It’s too easy to use it as a stand-in for 
>>>>>>>>>>>>>>>> consensus and move on
>>>>>>>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that 
>>>>>>>>>>>>>>>> discussion
>>>>>>>>>>>>>>>> has moved ahead of agreement.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We’re still at the most basic question: is a materialized
>>>>>>>>>>>>>>>> view a view and a separate table, a combination of the two 
>>>>>>>>>>>>>>>> (i.e. commits
>>>>>>>>>>>>>>>> are combined), or a new metadata type?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some
>>>>>>>>>>>>>>>> kind of “system table” (meaning hidden?) or if it is exposed 
>>>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>>>> catalog. That’s a later choice (already pointed out) and, I 
>>>>>>>>>>>>>>>> suspect, it
>>>>>>>>>>>>>>>> should be delegated to catalog implementations.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To simplify this a little, I think that we can eliminate
>>>>>>>>>>>>>>>> the option to combine table and view commits. I don’t think 
>>>>>>>>>>>>>>>> there is a
>>>>>>>>>>>>>>>> reason to combine the two. If separate, a table would track 
>>>>>>>>>>>>>>>> the view
>>>>>>>>>>>>>>>> version used along with freshness information for referenced 
>>>>>>>>>>>>>>>> tables. If the
>>>>>>>>>>>>>>>> table is automatically skipped when the version no longer 
>>>>>>>>>>>>>>>> matches the view,
>>>>>>>>>>>>>>>> then no action needs to happen when a view definition changes. 
>>>>>>>>>>>>>>>> Similarly,
>>>>>>>>>>>>>>>> the table can be updated independentl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to