Re: Materialized view integration with REST spec

Walaa Eldin Moustafa Fri, 01 Mar 2024 17:08:22 -0800

The calendar on the site is currently broken
https://iceberg.apache.org/community/#iceberg-community-events. Might help
to fix it or share the meeting link here.


On Fri, Mar 1, 2024 at 3:43 PM Jack Ye <yezhao...@gmail.com> wrote:

> Sounds good, let's discuss this in person!
>
> I am a bit worried that we have quite a few critical topics going on right
> now on devlist, and this will take up a lot of time to discuss. If it ends
> up going for too long, l propose let us have a dedicated meeting, and I am
> more than happy to organize it.
>
> Best,
> Jack Ye
>
> On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> wrote:
>
>> Hey everyone,
>>
>> I think this thread has hit a point of diminishing returns and that we
>> still don't have a common understanding of what the options under
>> consideration actually are.
>>
>> Since we were already planning on discussing this at the next community
>> sync, I suggest we pick this up there and use that time to align on what
>> exactly we're considering. We can then start a new thread to lay out the
>> designs under consideration in more detail and then have a discussion about
>> trade-offs.
>>
>> Does that sound reasonable?
>>
>> Ryan
>>
>>
>> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> I am finding it hard to interpret the options concretely. I would also
>>> suggest breaking the expectation/outcome to milestones. Maybe it becomes
>>> easier if we agree to distinguish between an approach that is feasible in
>>> the near term and another in the long term, especially if the latter
>>> requires significant engine-side changes.
>>>
>>> Further, maybe it helps if we start with an option that fully reuses the
>>> existing spec, and see how we view it in comparison with the options
>>> discussed previously. I am sharing one below. It reuses the current spec of
>>> Iceberg views and tables by leveraging table properties to capture
>>> materialized view metadata. What is common (and not common) between this
>>> and the desired representations?
>>>
>>> The new properties are:
>>> Properties on a View:
>>>
>>>    1.
>>>
>>>    *iceberg.materialized.view*:
>>>    - *Type*: View property
>>>       - *Purpose*: This property is used to mark whether a view is a
>>>       materialized view. If set to true, the view is treated as a
>>>       materialized view. This helps in differentiating between virtual and
>>>       materialized views within the catalog and dictates specific handling 
>>> and
>>>       validation logic for materialized views.
>>>    2.
>>>
>>>    *iceberg.materialized.view.storage.location*:
>>>    - *Type*: View property
>>>       - *Purpose*: Specifies the location of the storage table
>>>       associated with the materialized view. This property is used for 
>>> linking a
>>>       materialized view with its corresponding storage table, enabling data
>>>       management and query execution based on the stored data freshness.
>>>
>>> Properties on a Table:
>>>
>>>    1. *base.snapshot.[UUID]*:
>>>       - *Type*: Table property
>>>       - *Purpose*: These properties store the snapshot IDs of the base
>>>       tables at the time the materialized view's data was last updated. Each
>>>       property is prefixed with base.snapshot. followed by the UUID of
>>>       the base table. They are used to track whether the materialized 
>>> view's data
>>>       is up to date with the base tables by comparing these snapshot IDs 
>>> with the
>>>       current snapshot IDs of the base tables. If all the base tables' 
>>> current
>>>       snapshot IDs match the ones stored in these properties, the 
>>> materialized
>>>       view's data is considered fresh.
>>>
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> > All of these approaches are aligned in one, specific way: the storage
>>>> table is an iceberg table.
>>>>
>>>> I do not think that is true. I think people are aligned that we would
>>>> like to re-use the Iceberg table metadata defined in the Iceberg table spec
>>>> to express the data in MV, but I don't think it goes that far to say it
>>>> must be an Iceberg table. Once you have that mindset, then of course option
>>>> 1 (separate table and view) is the only option.
>>>>
>>>> > I don't think that is necessary and it significantly increases the
>>>> complexity.
>>>>
>>>> And can you quantify what you mean by "significantly increases the
>>>> complexity"? Seems like a lot of concerns are coming from the tradeoff with
>>>> complexity. We probably all agree that using option 7 (a completely new
>>>> metadata type) is a lot of work from scratch, that is why it is not
>>>> favored. However, my understanding is that as long as we re-use the view
>>>> and table metadata, then the majority of the existing logic can be reused.
>>>> I think what we have gone through in Slack to draft the rough Java API
>>>> shape helps here, because people can estimate the amount of effort required
>>>> to implement it. And I don't think they are **significantly** more complex
>>>> to implement. Could you elaborate more about the complexity that you
>>>> imagine?
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <daniel.c.we...@gmail.com>
>>>> wrote:
>>>>
>>>>> I feel I've been most vocal about pushing back against options 2+ (or
>>>>> Ryan's categories of combined table/view, or new metadata type), so I'll
>>>>> try to expand on my reasoning.
>>>>>
>>>>> I understand the appeal of creating a design where we encapsulate the
>>>>> view/storage from both a structural and performance standpoint, but I 
>>>>> don't
>>>>> think that is necessary and it significantly increases the complexity.
>>>>>
>>>>> All of these approaches are aligned in one, specific way: the storage
>>>>> table is an iceberg table.
>>>>>
>>>>> Because of this, all the behaviors and requirements still apply to
>>>>> these tables.  They need to be maintained (snapshot cleanup, orphan 
>>>>> files),
>>>>> in cases need to be optimized (compaction, manifest rewrites), they need 
>>>>> to
>>>>> be able to be inspected (this will be even more important with MV since
>>>>> staleness can produce different results and questions will arise about 
>>>>> what
>>>>> state the storage table was in).  There may be cases where the tables need
>>>>> to be managed directly.
>>>>>
>>>>> Anywhere we deviate from the existing constructs/commit/access for
>>>>> tables, we will ultimately have to then unwrap to re-expose the underlying
>>>>> Iceberg behavior.  This creates unnecessary complexity in the library/API
>>>>> layer, which are not the primary interface users will have with
>>>>> materialized views where an engine is almost entirely necessary to 
>>>>> interact
>>>>> with the dataset.
>>>>>
>>>>> As to the performance concerns around option 1, I think we're
>>>>> overstating the downsides.  It really comes down to how many metadata 
>>>>> loads
>>>>> are necessary and evaluating freshness would likely be the real bottleneck
>>>>> as it involves potentially loading many tables.  All of the options are on
>>>>> the same order of performance for the metadata and table loads.
>>>>>
>>>>> As to the visibility of tables and whether they're registered in the
>>>>> catalog, I think registering in the catalog is the right approach so that
>>>>> the tables are still addressable for maintenance/etc.  The visibility of
>>>>> the storage table is a catalog implementation decision and shouldn't be a
>>>>> requirement of the MV spec (I can see cases for both and it isn't 
>>>>> necessary
>>>>> to dictate a behavior).
>>>>>
>>>>> I'm still strongly in favor of Option 1 (separate table and view) for
>>>>> these reasons.
>>>>>
>>>>> -Dan
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> > Jack, it sounds like you’re the proponent of a combined table and
>>>>>> view (rather than a new metadata spec for a materialized view). What is 
>>>>>> the
>>>>>> main motivation? It seems like you’re convinced of that approach, but I
>>>>>> don’t understand the advantage it brings.
>>>>>>
>>>>>> Sorry I have to make a Google Sheet to capture all the options we
>>>>>> have discussed so far, I wanted to use the existing Google Doc, but it 
>>>>>> has
>>>>>> really bad table/sheet support...
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>>>
>>>>>> I have listed all the options, with how they are implemented and some
>>>>>> important considerations we have discussed so far. Note that:
>>>>>> 1. This sheet currently excludes the lineage information, which we
>>>>>> can discuss more later after the current topic is resolved.
>>>>>> 2. I removed the considerations for REST integration since from the
>>>>>> other thread we have clarified that they should be considered completely
>>>>>> separately.
>>>>>>
>>>>>> *Why I come as a proponent of having a new MV object with table and
>>>>>> view metadata file pointer*
>>>>>>
>>>>>> In my sheet, there are 3 options that do not have major problems:
>>>>>> Option 2: Add storage table metadata file pointer in view object
>>>>>> Option 5: New MV object with table and view metadata file pointer
>>>>>> Option 6: New MV spec with table and view metadata
>>>>>>
>>>>>> I originally excluded option 2 because I think it does not align with
>>>>>> the REST spec, but after the other discussion thread about "Inconsistency
>>>>>> between REST spec and table/view spec", I think my original concern no
>>>>>> longer holds true so now I put it back. And based on my personal
>>>>>> preference that MV is an independent object that should be separated from
>>>>>> view and table, plus the fact that option 5 is probably less work than
>>>>>> option 6 for implementation, that is how I come as a proponent of option 
>>>>>> 5
>>>>>> at this moment.
>>>>>>
>>>>>>
>>>>>> *Regarding Ryan's evaluation framework*
>>>>>>
>>>>>> I think we need to reconcile this sheet with Ryan's evaluation
>>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 all
>>>>>> under the same category of "A combination of a view and a table" and
>>>>>> concludes that they don't have any advantage for the same set of reasons.
>>>>>> But those reasons are not really convincing to me so let's talk about 
>>>>>> them
>>>>>> in more detail.
>>>>>>
>>>>>> (1) You said "I don’t see a reason why a combined view and table is
>>>>>> advantageous" as "this would cause unnecessary dependence between the 
>>>>>> view
>>>>>> and table in catalogs."  What dependency exactly do you mean here? And 
>>>>>> why
>>>>>> is that unnecessary, given there has to be some sort of dependency anyway
>>>>>> unless we go with option 5 or 6?
>>>>>>
>>>>>> (2) You said "I guess there’s an argument that you could load both
>>>>>> table and view metadata locations at the same time. That hardly seems 
>>>>>> worth
>>>>>> the trouble". I disagree with that. Catalog interaction performance is
>>>>>> critical to at least everyone working in EMR and Athena, and MV itself as
>>>>>> an acceleration approach needs to be as fast as possible.
>>>>>>
>>>>>> I have put 3 key operations in the doc that I think matters for MV
>>>>>> during interactions with engine:
>>>>>> 1. refreshes storage table
>>>>>> 2. get the storage table of the MV
>>>>>> 3. if stale, get the view SQL
>>>>>>
>>>>>> And option 1 clearly falls short with 4 sequential steps required to
>>>>>> load a storage table. You mentioned "recent issues with adding views to 
>>>>>> the
>>>>>> JDBC catalog" in this topic, could you explain a bit more?
>>>>>>
>>>>>> (3) You said "I also think that once we decide on structure, we can
>>>>>> make it possible for REST catalog implementations to do smart things, in 
>>>>>> a
>>>>>> way that doesn’t put additional requirements on the underlying catalog
>>>>>> store." If REST is fully compatible with Iceberg spec then I have no
>>>>>> problem with this statement. However, as we discussed in the other 
>>>>>> thread,
>>>>>> it is not the case. In the current state, I think the sequence of action
>>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) first,
>>>>>> and then think about how REST can incorporate it or do smart things that
>>>>>> are not Iceberg spec compliant. Do you agree with that?
>>>>>>
>>>>>> (4) You said the table identifier pointer "is a problem we need to
>>>>>> solve generally because a materialized table needs to be able to track 
>>>>>> the
>>>>>> upstream state of tables that were used". I don't think that is a reason 
>>>>>> to
>>>>>> choose to use a table identifier pointer for a storage table. The issue 
>>>>>> is
>>>>>> not about using a table identifier pointer. It is about exposing the
>>>>>> storage table as a separate entity in the catalog, which is what people 
>>>>>> do
>>>>>> not like and is already discussed in length in Jan's question 3 (also
>>>>>> linked in the sheet). I agree with that statement, because without a REST
>>>>>> implementation that can magically hide the storage table, this model adds
>>>>>> additional burden regarding compliance and data governance for any other
>>>>>> non-REST catalog implementations that are compliant to the Iceberg spec.
>>>>>> Many mechanisms need to be built in a catalog to hide, protect, maintain,
>>>>>> recycle the storage table, that can be avoided by using other 
>>>>>> approaches. I
>>>>>> think we should reach a consensus about that and discuss further if you 
>>>>>> do
>>>>>> not agree.
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul <jank...@mailbox.org.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ryan, we actually discussed your categories in this question
>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>>>> Where your categories correspond to the following designs:
>>>>>>>
>>>>>>>    - Separate table and view => Design 1
>>>>>>>    - Combination of view and table => Design 2
>>>>>>>    - A new metadata type => Design 4
>>>>>>>
>>>>>>> Jan
>>>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>>>
>>>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so
>>>>>>> I’ll be more specific:
>>>>>>>
>>>>>>>    - *Separate table and view*: this option is to have the objects
>>>>>>>    that we have today, with extra metadata. Commit processes are 
>>>>>>> separate:
>>>>>>>    committing to the table doesn’t alter the view and committing to the 
>>>>>>> view
>>>>>>>    doesn’t change the table. However, changing the view can make it so 
>>>>>>> the
>>>>>>>    table is no longer useful as a materialization.
>>>>>>>    - *A combination of a view and a table*: in this option, the
>>>>>>>    table metadata and view metadata are the same as the first option. 
>>>>>>> The
>>>>>>>    difference is that the commit process combines them, either by 
>>>>>>> embedding a
>>>>>>>    table metadata location in view metadata or by tracking both in the 
>>>>>>> same
>>>>>>>    catalog reference.
>>>>>>>    - *A new metadata type*: this option is where we define a new
>>>>>>>    metadata object that has view attributes, like SQL representations, 
>>>>>>> along
>>>>>>>    with table attributes, like partition specs and snapshots.
>>>>>>>
>>>>>>> Hopefully this is clear because I think much of the confusion is
>>>>>>> caused by different definitions.
>>>>>>>
>>>>>>> The LoadTableResponse having optional metadata-location field
>>>>>>> implies that the object in the catalog no longer needs to hold a 
>>>>>>> metadata
>>>>>>> file pointer
>>>>>>>
>>>>>>> The REST protocol has not removed the requirement for a metadata
>>>>>>> file, so I’m going to keep focused on the MV design options.
>>>>>>>
>>>>>>> When we say a MV can be a “new metadata type”, it does not mean it
>>>>>>> needs to define a completely brand new structure of the metadata content
>>>>>>>
>>>>>>> I’m making a distinction between separate metadata files for the
>>>>>>> table and the view and a combined metadata object, as above.
>>>>>>>
>>>>>>> We can define an “Iceberg MV” to be an object in a catalog, which
>>>>>>> has 1 table metadata file pointer, and 1 view metadata file pointer
>>>>>>>
>>>>>>> This is the option I am referring to as a “combination of a view and
>>>>>>> a table”.
>>>>>>>
>>>>>>> So to review my initial email, I don’t see a reason why a combined
>>>>>>> view and table is advantageous, either implemented by having a catalog
>>>>>>> reference with two metadata locations or embedding a table metadata
>>>>>>> location in view metadata. This would cause unnecessary dependence 
>>>>>>> between
>>>>>>> the view and table in catalogs. I guess there’s an argument that you 
>>>>>>> could
>>>>>>> load both table and view metadata locations at the same time. That 
>>>>>>> hardly
>>>>>>> seems worth the trouble given the recent issues with adding views to the
>>>>>>> JDBC catalog.
>>>>>>>
>>>>>>> I also think that once we decide on structure, we can make it
>>>>>>> possible for REST catalog implementations to do smart things, in a way 
>>>>>>> that
>>>>>>> doesn’t put additional requirements on the underlying catalog store. For
>>>>>>> instance, we could specify how to send additional objects in a
>>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table metadata. I
>>>>>>> think these optimizations are a later addition, after we define the
>>>>>>> relationship between views and tables.
>>>>>>>
>>>>>>> Jack, it sounds like you’re the proponent of a combined table and
>>>>>>> view (rather than a new metadata spec for a materialized view). What is 
>>>>>>> the
>>>>>>> main motivation? It seems like you’re convinced of that approach, but I
>>>>>>> don’t understand the advantage it brings.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> Yes I mostly agree with the assessment.  To clarify a few minor
>>>>>>>> points.
>>>>>>>>
>>>>>>>> is a materialized view a view and a separate table, a combination
>>>>>>>>> of the two (i.e. commits are combined), or a new metadata type?
>>>>>>>>
>>>>>>>>
>>>>>>>> For 'new metadata type', I consider mostly Jack's initial proposal
>>>>>>>> of a new Catalog MV object that has two references (ViewMetadata +
>>>>>>>> TableMetadata).
>>>>>>>>
>>>>>>>> The arguments that I see for a combined materialized view object
>>>>>>>>> are:
>>>>>>>>>
>>>>>>>>>    - Regular views are separate, rather than being tables with
>>>>>>>>>    SQL and no data so it would be inconsistent (“Iceberg view is just 
>>>>>>>>> a table
>>>>>>>>>    with no data but with representations defined. But we did not do 
>>>>>>>>> that.”)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>    materialized views
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Tables are not typically exposed to end users — but this
>>>>>>>>>    isn’t required by the separate view and table option
>>>>>>>>>
>>>>>>>>> For completeness, there seem to be a few additional ones
>>>>>>>> (mentioned in the Slack and above messages).
>>>>>>>>
>>>>>>>>    - Lack of spec change (to ViewMetadata).  But as Jack says it
>>>>>>>>    is a spec change (ie, to catalogs)
>>>>>>>>    - A single call to get the View's StorageTable (versus two
>>>>>>>>    calls)
>>>>>>>>    - A more natural API, no opportunity for user to call
>>>>>>>>    Catalog.dropTable() and renameTable() on storage table
>>>>>>>>
>>>>>>>>
>>>>>>>> *Thoughts:  *I think the long discussion sessions we had on Slack
>>>>>>>> was fruitful for me, as seeing the API clarified some things.
>>>>>>>>
>>>>>>>> I was initially more in favor of MV being a new metadata type
>>>>>>>> (TableMetadata + ViewMetadata).  But seeing most of the MV operations 
>>>>>>>> end
>>>>>>>> up being ViewCatalog or Catalog operations, I am starting to think 
>>>>>>>> API-wise
>>>>>>>> that it may not align with the new metadata type (unless we define
>>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate wrappers).
>>>>>>>>
>>>>>>>> Initially one question I had for option 'a view and a separate
>>>>>>>> table', was how to make this table reference (metadata.json or catalog
>>>>>>>> reference).  In the previous option, we had a precedent of Catalog
>>>>>>>> references to Metadata, but not pointers between Metadatas.  I 
>>>>>>>> initially
>>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' 
>>>>>>>> catalog
>>>>>>>> concerns in ViewMetadata.  (I saw Catalog and ViewCatalog as a layer 
>>>>>>>> above
>>>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack made a 
>>>>>>>> fair
>>>>>>>> point that ViewMetadata already is tightly bound with a Catalog.  In 
>>>>>>>> this
>>>>>>>> case, I think this approach does have its merits as well in aligning
>>>>>>>> Catalog API's with the metadata.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Szehon
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I would like to provide my perspective on the question of what a
>>>>>>>>> materialized view is and elaborate on Jack's recent proposal to view a
>>>>>>>>> materialized view as a catalog concept.
>>>>>>>>>
>>>>>>>>> Firstly, let's look at the role of the catalog. Every entity in
>>>>>>>>> the catalog has a *unique identifier*, and the catalog provides
>>>>>>>>> methods to create, load, and update these entities. An important 
>>>>>>>>> thing to
>>>>>>>>> note is that the catalog methods exhibit two different behaviors: the 
>>>>>>>>> *create
>>>>>>>>> and load methods deal with the entire entity*, while the 
>>>>>>>>> *update(commit)
>>>>>>>>> method only deals with partial changes* to the entities.
>>>>>>>>>
>>>>>>>>> In the context of our current discussion, materialized view (MV)
>>>>>>>>> metadata is a union of view and table metadata. The fact that the 
>>>>>>>>> update
>>>>>>>>> method deals only with partial changes, enables us to *reuse the
>>>>>>>>> existing methods for updating tables and views*. For updates we
>>>>>>>>> don't have to define what constitutes an entire materialized view. 
>>>>>>>>> Changes
>>>>>>>>> to a materialized view targeting the properties related to the view
>>>>>>>>> metadata could use the update(commit) view method. Similarly, changes
>>>>>>>>> targeting the properties related to the table metadata could use the
>>>>>>>>> update(commit) table method. This is great news because we don't have 
>>>>>>>>> to
>>>>>>>>> redefine view and table commits (requirements, updates).
>>>>>>>>> This is shown in the fact that Jack uses the same operation to
>>>>>>>>> update the storage table for Option 1 and 3:
>>>>>>>>>
>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
>>>>>>>>> // non-REST: update JSON files at table_metadata_location
>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>
>>>>>>>>> The open question is *whether the create and load methods should
>>>>>>>>> treat the properties that constitute the MV metadata as two entities 
>>>>>>>>> (View
>>>>>>>>> + Table) or one entity (new MV object)*. This is all part of
>>>>>>>>> Jack's proposal, where Option 1 proposes a new MV object, and Option 3
>>>>>>>>> proposes two separate entities. The advantage of Option 1 is that it
>>>>>>>>> doesn't require two operations to load the metadata. On the other 
>>>>>>>>> hand, the
>>>>>>>>> advantage of Option 3 is that no new operations or catalogs have to be
>>>>>>>>> defined.
>>>>>>>>>
>>>>>>>>> In my opinion, defining a new representation for materialized
>>>>>>>>> views (Option 1) is generally the cleaner solution. However, I see a 
>>>>>>>>> path
>>>>>>>>> where we could first introduce Option 3 and still have the 
>>>>>>>>> possibility to
>>>>>>>>> transition to Option 1 if needed. The great thing about Option 3 is 
>>>>>>>>> that it
>>>>>>>>> only requires minor changes to the current spec and is mostly
>>>>>>>>> implementation detail.
>>>>>>>>>
>>>>>>>>> Therefore I would propose small additions to Jacks Option 3 that
>>>>>>>>> only introduce changes to the spec that are not specific to 
>>>>>>>>> materialized
>>>>>>>>> views. The idea is to introduce boolean properties to be set on the
>>>>>>>>> creation of the view and the storage table that indicate that they 
>>>>>>>>> belong
>>>>>>>>> to a materialized view. The view property "materialized" is set to 
>>>>>>>>> "true"
>>>>>>>>> for a MV and "false" for a regular view. And the table property
>>>>>>>>> "storage_table" is set to "true" for a storage table and "false" for a
>>>>>>>>> regular table. The absence of these properties indicates a regular 
>>>>>>>>> view or
>>>>>>>>> table.
>>>>>>>>>
>>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>>>
>>>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));
>>>>>>>>>
>>>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>>>> // non-REST: load JSON file at table_metadata_location if present
>>>>>>>>> Table storageTable = view.storageTable();
>>>>>>>>>
>>>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>>>> // non-REST: update JSON file at table_metadata_location
>>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>>
>>>>>>>>> We could then introduce a new requirement for views and tables
>>>>>>>>> called "AssertProperty" which could make sure to only perform updates 
>>>>>>>>> that
>>>>>>>>> are inline with materialized views. The additional requirement can be 
>>>>>>>>> seen
>>>>>>>>> as a general extension which does not need to be changed if we decide 
>>>>>>>>> to
>>>>>>>>> got with Option 1 in the future.
>>>>>>>>>
>>>>>>>>> Let me know what you think.
>>>>>>>>>
>>>>>>>>> Best wishes,
>>>>>>>>>
>>>>>>>>> Jan
>>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>>>>>>
>>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing
>>>>>>>>> metadata definitions and minimizing spec changes are very important. 
>>>>>>>>> This
>>>>>>>>> also minimizes spec drift (between materialized views and views spec, 
>>>>>>>>> and
>>>>>>>>> between materialized views and tables spec), and simplifies the
>>>>>>>>> implementation.
>>>>>>>>>
>>>>>>>>> In an effort to take the discussion forward with concrete design
>>>>>>>>> options based on an end-to-end implementation, I have prototyped the
>>>>>>>>> implementation (and added Spark support) in this PR
>>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps us
>>>>>>>>> reach convergence faster. More details about some of the design 
>>>>>>>>> options are
>>>>>>>>> discussed in the description of the PR.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>>> I mean separate table and view metadata that is somehow combined
>>>>>>>>>> through a commit process. For instance, keeping a pointer to a table
>>>>>>>>>> metadata file in a view metadata file or combining commits to 
>>>>>>>>>> reference
>>>>>>>>>> both. I don't see the value in either option.
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Ryan for the help to trace back to the root question!
>>>>>>>>>>> Just a clarification question regarding your reply before I reply 
>>>>>>>>>>> further:
>>>>>>>>>>> what exactly does the option "a combination of the two (i.e. 
>>>>>>>>>>> commits are
>>>>>>>>>>> combined)" mean? How is that different from "a new metadata type"?
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I’m catching up on this conversation, so hopefully I can bring
>>>>>>>>>>>> a fresh perspective.
>>>>>>>>>>>>
>>>>>>>>>>>> Jack already pointed out that we need to start from the basics
>>>>>>>>>>>> and I agree with that. Let’s remove voting at this point. Right 
>>>>>>>>>>>> now is the
>>>>>>>>>>>> time for discussing trade-offs, not lining up and taking sides. I 
>>>>>>>>>>>> realize
>>>>>>>>>>>> that wasn’t the intent with adding a vote, but that’s almost 
>>>>>>>>>>>> always the
>>>>>>>>>>>> result. It’s too easy to use it as a stand-in for consensus and 
>>>>>>>>>>>> move on
>>>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that 
>>>>>>>>>>>> discussion
>>>>>>>>>>>> has moved ahead of agreement.
>>>>>>>>>>>>
>>>>>>>>>>>> We’re still at the most basic question: is a materialized view
>>>>>>>>>>>> a view and a separate table, a combination of the two (i.e. 
>>>>>>>>>>>> commits are
>>>>>>>>>>>> combined), or a new metadata type?
>>>>>>>>>>>>
>>>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some kind
>>>>>>>>>>>> of “system table” (meaning hidden?) or if it is exposed in the 
>>>>>>>>>>>> catalog.
>>>>>>>>>>>> That’s a later choice (already pointed out) and, I suspect, it 
>>>>>>>>>>>> should be
>>>>>>>>>>>> delegated to catalog implementations.
>>>>>>>>>>>>
>>>>>>>>>>>> To simplify this a little, I think that we can eliminate the
>>>>>>>>>>>> option to combine table and view commits. I don’t think there is a 
>>>>>>>>>>>> reason
>>>>>>>>>>>> to combine the two. If separate, a table would track the view 
>>>>>>>>>>>> version used
>>>>>>>>>>>> along with freshness information for referenced tables. If the 
>>>>>>>>>>>> table is
>>>>>>>>>>>> automatically skipped when the version no longer matches the view, 
>>>>>>>>>>>> then no
>>>>>>>>>>>> action needs to happen when a view definition changes. Similarly, 
>>>>>>>>>>>> the table
>>>>>>>>>>>> can be updated independently without needing to also swap view 
>>>>>>>>>>>> metadata.
>>>>>>>>>>>> This also aligns with the idea from the original doc that there 
>>>>>>>>>>>> can be
>>>>>>>>>>>> multiple materialization tables for a view. Each should operate
>>>>>>>>>>>> independently unless I’m missing something
>>>>>>>>>>>>
>>>>>>>>>>>> I don’t think the last paragraph’s conclusion is contentious so
>>>>>>>>>>>> I’ll move on, but please stop here and reply if you disagree!
>>>>>>>>>>>>
>>>>>>>>>>>> That leaves the main two options, a view and a separate table
>>>>>>>>>>>> linked by metadata, or, combined materialized view metadata.
>>>>>>>>>>>>
>>>>>>>>>>>> As the doc notes, the separate view and table option is simpler
>>>>>>>>>>>> because it reuses existing metadata definitions and falls back to 
>>>>>>>>>>>> simple
>>>>>>>>>>>> views. That is a significantly smaller spec and small is very, very
>>>>>>>>>>>> important when it comes to specs. I think that the argument for a 
>>>>>>>>>>>> new
>>>>>>>>>>>> definition of a materialized view needs to overcome this 
>>>>>>>>>>>> disadvantage.
>>>>>>>>>>>>
>>>>>>>>>>>> The arguments that I see for a combined materialized view
>>>>>>>>>>>> object are:
>>>>>>>>>>>>
>>>>>>>>>>>>    - Regular views are separate, rather than being tables with
>>>>>>>>>>>>    SQL and no data so it would be inconsistent (“Iceberg view is 
>>>>>>>>>>>> just a table
>>>>>>>>>>>>    with no data but with representations defined. But we did not 
>>>>>>>>>>>> do that.”)
>>>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>>>>    materialized views
>>>>>>>>>>>>    - Tables are not typically exposed to end users — but this
>>>>>>>>>>>>    isn’t required by the separate view and table option
>>>>>>>>>>>>
>>>>>>>>>>>> Am I missing any arguments for combined metadata?
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Tabular
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Materialized view integration with REST spec

Reply via email to