Re: Materialized view integration with REST spec

Ryan Blue Fri, 01 Mar 2024 12:48:47 -0800

Hey everyone,

I think this thread has hit a point of diminishing returns and that we
still don't have a common understanding of what the options under
consideration actually are.


Since we were already planning on discussing this at the next community
sync, I suggest we pick this up there and use that time to align on what
exactly we're considering. We can then start a new thread to lay out the
designs under consideration in more detail and then have a discussion about
trade-offs.

Does that sound reasonable?

Ryan


On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <[email protected]>
wrote:

> I am finding it hard to interpret the options concretely. I would also
> suggest breaking the expectation/outcome to milestones. Maybe it becomes
> easier if we agree to distinguish between an approach that is feasible in
> the near term and another in the long term, especially if the latter
> requires significant engine-side changes.
>
> Further, maybe it helps if we start with an option that fully reuses the
> existing spec, and see how we view it in comparison with the options
> discussed previously. I am sharing one below. It reuses the current spec of
> Iceberg views and tables by leveraging table properties to capture
> materialized view metadata. What is common (and not common) between this
> and the desired representations?
>
> The new properties are:
> Properties on a View:
>
>    1.
>
>    *iceberg.materialized.view*:
>    - *Type*: View property
>       - *Purpose*: This property is used to mark whether a view is a
>       materialized view. If set to true, the view is treated as a
>       materialized view. This helps in differentiating between virtual and
>       materialized views within the catalog and dictates specific handling and
>       validation logic for materialized views.
>    2.
>
>    *iceberg.materialized.view.storage.location*:
>    - *Type*: View property
>       - *Purpose*: Specifies the location of the storage table associated
>       with the materialized view. This property is used for linking a
>       materialized view with its corresponding storage table, enabling data
>       management and query execution based on the stored data freshness.
>
> Properties on a Table:
>
>    1. *base.snapshot.[UUID]*:
>       - *Type*: Table property
>       - *Purpose*: These properties store the snapshot IDs of the base
>       tables at the time the materialized view's data was last updated. Each
>       property is prefixed with base.snapshot. followed by the UUID of
>       the base table. They are used to track whether the materialized view's 
> data
>       is up to date with the base tables by comparing these snapshot IDs with 
> the
>       current snapshot IDs of the base tables. If all the base tables' current
>       snapshot IDs match the ones stored in these properties, the materialized
>       view's data is considered fresh.
>
>
> Thanks,
> Walaa.
>
>
> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <[email protected]> wrote:
>
>> > All of these approaches are aligned in one, specific way: the storage
>> table is an iceberg table.
>>
>> I do not think that is true. I think people are aligned that we would
>> like to re-use the Iceberg table metadata defined in the Iceberg table spec
>> to express the data in MV, but I don't think it goes that far to say it
>> must be an Iceberg table. Once you have that mindset, then of course option
>> 1 (separate table and view) is the only option.
>>
>> > I don't think that is necessary and it significantly increases the
>> complexity.
>>
>> And can you quantify what you mean by "significantly increases the
>> complexity"? Seems like a lot of concerns are coming from the tradeoff with
>> complexity. We probably all agree that using option 7 (a completely new
>> metadata type) is a lot of work from scratch, that is why it is not
>> favored. However, my understanding is that as long as we re-use the view
>> and table metadata, then the majority of the existing logic can be reused.
>> I think what we have gone through in Slack to draft the rough Java API
>> shape helps here, because people can estimate the amount of effort required
>> to implement it. And I don't think they are **significantly** more complex
>> to implement. Could you elaborate more about the complexity that you
>> imagine?
>>
>> -Jack
>>
>>
>>
>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <[email protected]>
>> wrote:
>>
>>> I feel I've been most vocal about pushing back against options 2+ (or
>>> Ryan's categories of combined table/view, or new metadata type), so I'll
>>> try to expand on my reasoning.
>>>
>>> I understand the appeal of creating a design where we encapsulate the
>>> view/storage from both a structural and performance standpoint, but I don't
>>> think that is necessary and it significantly increases the complexity.
>>>
>>> All of these approaches are aligned in one, specific way: the storage
>>> table is an iceberg table.
>>>
>>> Because of this, all the behaviors and requirements still apply to these
>>> tables.  They need to be maintained (snapshot cleanup, orphan files), in
>>> cases need to be optimized (compaction, manifest rewrites), they need to be
>>> able to be inspected (this will be even more important with MV since
>>> staleness can produce different results and questions will arise about what
>>> state the storage table was in).  There may be cases where the tables need
>>> to be managed directly.
>>>
>>> Anywhere we deviate from the existing constructs/commit/access for
>>> tables, we will ultimately have to then unwrap to re-expose the underlying
>>> Iceberg behavior.  This creates unnecessary complexity in the library/API
>>> layer, which are not the primary interface users will have with
>>> materialized views where an engine is almost entirely necessary to interact
>>> with the dataset.
>>>
>>> As to the performance concerns around option 1, I think we're
>>> overstating the downsides.  It really comes down to how many metadata loads
>>> are necessary and evaluating freshness would likely be the real bottleneck
>>> as it involves potentially loading many tables.  All of the options are on
>>> the same order of performance for the metadata and table loads.
>>>
>>> As to the visibility of tables and whether they're registered in the
>>> catalog, I think registering in the catalog is the right approach so that
>>> the tables are still addressable for maintenance/etc.  The visibility of
>>> the storage table is a catalog implementation decision and shouldn't be a
>>> requirement of the MV spec (I can see cases for both and it isn't necessary
>>> to dictate a behavior).
>>>
>>> I'm still strongly in favor of Option 1 (separate table and view) for
>>> these reasons.
>>>
>>> -Dan
>>>
>>>
>>>
>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <[email protected]> wrote:
>>>
>>>> > Jack, it sounds like you’re the proponent of a combined table and
>>>> view (rather than a new metadata spec for a materialized view). What is the
>>>> main motivation? It seems like you’re convinced of that approach, but I
>>>> don’t understand the advantage it brings.
>>>>
>>>> Sorry I have to make a Google Sheet to capture all the options we have
>>>> discussed so far, I wanted to use the existing Google Doc, but it has
>>>> really bad table/sheet support...
>>>>
>>>>
>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>
>>>> I have listed all the options, with how they are implemented and some
>>>> important considerations we have discussed so far. Note that:
>>>> 1. This sheet currently excludes the lineage information, which we can
>>>> discuss more later after the current topic is resolved.
>>>> 2. I removed the considerations for REST integration since from the
>>>> other thread we have clarified that they should be considered completely
>>>> separately.
>>>>
>>>> *Why I come as a proponent of having a new MV object with table and
>>>> view metadata file pointer*
>>>>
>>>> In my sheet, there are 3 options that do not have major problems:
>>>> Option 2: Add storage table metadata file pointer in view object
>>>> Option 5: New MV object with table and view metadata file pointer
>>>> Option 6: New MV spec with table and view metadata
>>>>
>>>> I originally excluded option 2 because I think it does not align with
>>>> the REST spec, but after the other discussion thread about "Inconsistency
>>>> between REST spec and table/view spec", I think my original concern no
>>>> longer holds true so now I put it back. And based on my personal
>>>> preference that MV is an independent object that should be separated from
>>>> view and table, plus the fact that option 5 is probably less work than
>>>> option 6 for implementation, that is how I come as a proponent of option 5
>>>> at this moment.
>>>>
>>>>
>>>> *Regarding Ryan's evaluation framework*
>>>>
>>>> I think we need to reconcile this sheet with Ryan's evaluation
>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 all
>>>> under the same category of "A combination of a view and a table" and
>>>> concludes that they don't have any advantage for the same set of reasons.
>>>> But those reasons are not really convincing to me so let's talk about them
>>>> in more detail.
>>>>
>>>> (1) You said "I don’t see a reason why a combined view and table is
>>>> advantageous" as "this would cause unnecessary dependence between the view
>>>> and table in catalogs."  What dependency exactly do you mean here? And why
>>>> is that unnecessary, given there has to be some sort of dependency anyway
>>>> unless we go with option 5 or 6?
>>>>
>>>> (2) You said "I guess there’s an argument that you could load both
>>>> table and view metadata locations at the same time. That hardly seems worth
>>>> the trouble". I disagree with that. Catalog interaction performance is
>>>> critical to at least everyone working in EMR and Athena, and MV itself as
>>>> an acceleration approach needs to be as fast as possible.
>>>>
>>>> I have put 3 key operations in the doc that I think matters for MV
>>>> during interactions with engine:
>>>> 1. refreshes storage table
>>>> 2. get the storage table of the MV
>>>> 3. if stale, get the view SQL
>>>>
>>>> And option 1 clearly falls short with 4 sequential steps required to
>>>> load a storage table. You mentioned "recent issues with adding views to the
>>>> JDBC catalog" in this topic, could you explain a bit more?
>>>>
>>>> (3) You said "I also think that once we decide on structure, we can
>>>> make it possible for REST catalog implementations to do smart things, in a
>>>> way that doesn’t put additional requirements on the underlying catalog
>>>> store." If REST is fully compatible with Iceberg spec then I have no
>>>> problem with this statement. However, as we discussed in the other thread,
>>>> it is not the case. In the current state, I think the sequence of action
>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) first,
>>>> and then think about how REST can incorporate it or do smart things that
>>>> are not Iceberg spec compliant. Do you agree with that?
>>>>
>>>> (4) You said the table identifier pointer "is a problem we need to
>>>> solve generally because a materialized table needs to be able to track the
>>>> upstream state of tables that were used". I don't think that is a reason to
>>>> choose to use a table identifier pointer for a storage table. The issue is
>>>> not about using a table identifier pointer. It is about exposing the
>>>> storage table as a separate entity in the catalog, which is what people do
>>>> not like and is already discussed in length in Jan's question 3 (also
>>>> linked in the sheet). I agree with that statement, because without a REST
>>>> implementation that can magically hide the storage table, this model adds
>>>> additional burden regarding compliance and data governance for any other
>>>> non-REST catalog implementations that are compliant to the Iceberg spec.
>>>> Many mechanisms need to be built in a catalog to hide, protect, maintain,
>>>> recycle the storage table, that can be avoided by using other approaches. I
>>>> think we should reach a consensus about that and discuss further if you do
>>>> not agree.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Ryan, we actually discussed your categories in this question
>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>> Where your categories correspond to the following designs:
>>>>>
>>>>>    - Separate table and view => Design 1
>>>>>    - Combination of view and table => Design 2
>>>>>    - A new metadata type => Design 4
>>>>>
>>>>> Jan
>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>
>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so I’ll
>>>>> be more specific:
>>>>>
>>>>>    - *Separate table and view*: this option is to have the objects
>>>>>    that we have today, with extra metadata. Commit processes are separate:
>>>>>    committing to the table doesn’t alter the view and committing to the 
>>>>> view
>>>>>    doesn’t change the table. However, changing the view can make it so the
>>>>>    table is no longer useful as a materialization.
>>>>>    - *A combination of a view and a table*: in this option, the table
>>>>>    metadata and view metadata are the same as the first option. The 
>>>>> difference
>>>>>    is that the commit process combines them, either by embedding a table
>>>>>    metadata location in view metadata or by tracking both in the same 
>>>>> catalog
>>>>>    reference.
>>>>>    - *A new metadata type*: this option is where we define a new
>>>>>    metadata object that has view attributes, like SQL representations, 
>>>>> along
>>>>>    with table attributes, like partition specs and snapshots.
>>>>>
>>>>> Hopefully this is clear because I think much of the confusion is
>>>>> caused by different definitions.
>>>>>
>>>>> The LoadTableResponse having optional metadata-location field implies
>>>>> that the object in the catalog no longer needs to hold a metadata file
>>>>> pointer
>>>>>
>>>>> The REST protocol has not removed the requirement for a metadata file,
>>>>> so I’m going to keep focused on the MV design options.
>>>>>
>>>>> When we say a MV can be a “new metadata type”, it does not mean it
>>>>> needs to define a completely brand new structure of the metadata content
>>>>>
>>>>> I’m making a distinction between separate metadata files for the table
>>>>> and the view and a combined metadata object, as above.
>>>>>
>>>>> We can define an “Iceberg MV” to be an object in a catalog, which has
>>>>> 1 table metadata file pointer, and 1 view metadata file pointer
>>>>>
>>>>> This is the option I am referring to as a “combination of a view and a
>>>>> table”.
>>>>>
>>>>> So to review my initial email, I don’t see a reason why a combined
>>>>> view and table is advantageous, either implemented by having a catalog
>>>>> reference with two metadata locations or embedding a table metadata
>>>>> location in view metadata. This would cause unnecessary dependence between
>>>>> the view and table in catalogs. I guess there’s an argument that you could
>>>>> load both table and view metadata locations at the same time. That hardly
>>>>> seems worth the trouble given the recent issues with adding views to the
>>>>> JDBC catalog.
>>>>>
>>>>> I also think that once we decide on structure, we can make it possible
>>>>> for REST catalog implementations to do smart things, in a way that doesn’t
>>>>> put additional requirements on the underlying catalog store. For instance,
>>>>> we could specify how to send additional objects in a LoadViewResult, in
>>>>> case the catalog wants to pre-fetch table metadata. I think these
>>>>> optimizations are a later addition, after we define the relationship
>>>>> between views and tables.
>>>>>
>>>>> Jack, it sounds like you’re the proponent of a combined table and view
>>>>> (rather than a new metadata spec for a materialized view). What is the 
>>>>> main
>>>>> motivation? It seems like you’re convinced of that approach, but I don’t
>>>>> understand the advantage it brings.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Yes I mostly agree with the assessment.  To clarify a few minor
>>>>>> points.
>>>>>>
>>>>>> is a materialized view a view and a separate table, a combination of
>>>>>>> the two (i.e. commits are combined), or a new metadata type?
>>>>>>
>>>>>>
>>>>>> For 'new metadata type', I consider mostly Jack's initial proposal of
>>>>>> a new Catalog MV object that has two references (ViewMetadata +
>>>>>> TableMetadata).
>>>>>>
>>>>>> The arguments that I see for a combined materialized view object are:
>>>>>>>
>>>>>>>    - Regular views are separate, rather than being tables with SQL
>>>>>>>    and no data so it would be inconsistent (“Iceberg view is just a 
>>>>>>> table with
>>>>>>>    no data but with representations defined. But we did not do that.”)
>>>>>>>
>>>>>>>
>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>
>>>>>>>
>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>    materialized views
>>>>>>>
>>>>>>>
>>>>>>>    - Tables are not typically exposed to end users — but this isn’t
>>>>>>>    required by the separate view and table option
>>>>>>>
>>>>>>> For completeness, there seem to be a few additional ones (mentioned
>>>>>> in the Slack and above messages).
>>>>>>
>>>>>>    - Lack of spec change (to ViewMetadata).  But as Jack says it is
>>>>>>    a spec change (ie, to catalogs)
>>>>>>    - A single call to get the View's StorageTable (versus two calls)
>>>>>>    - A more natural API, no opportunity for user to call
>>>>>>    Catalog.dropTable() and renameTable() on storage table
>>>>>>
>>>>>>
>>>>>> *Thoughts:  *I think the long discussion sessions we had on Slack
>>>>>> was fruitful for me, as seeing the API clarified some things.
>>>>>>
>>>>>> I was initially more in favor of MV being a new metadata type
>>>>>> (TableMetadata + ViewMetadata).  But seeing most of the MV operations end
>>>>>> up being ViewCatalog or Catalog operations, I am starting to think 
>>>>>> API-wise
>>>>>> that it may not align with the new metadata type (unless we define
>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate wrappers).
>>>>>>
>>>>>> Initially one question I had for option 'a view and a separate
>>>>>> table', was how to make this table reference (metadata.json or catalog
>>>>>> reference).  In the previous option, we had a precedent of Catalog
>>>>>> references to Metadata, but not pointers between Metadatas.  I initially
>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' catalog
>>>>>> concerns in ViewMetadata.  (I saw Catalog and ViewCatalog as a layer 
>>>>>> above
>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack made a 
>>>>>> fair
>>>>>> point that ViewMetadata already is tightly bound with a Catalog.  In this
>>>>>> case, I think this approach does have its merits as well in aligning
>>>>>> Catalog API's with the metadata.
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I would like to provide my perspective on the question of what a
>>>>>>> materialized view is and elaborate on Jack's recent proposal to view a
>>>>>>> materialized view as a catalog concept.
>>>>>>>
>>>>>>> Firstly, let's look at the role of the catalog. Every entity in the
>>>>>>> catalog has a *unique identifier*, and the catalog provides methods
>>>>>>> to create, load, and update these entities. An important thing to note 
>>>>>>> is
>>>>>>> that the catalog methods exhibit two different behaviors: the *create
>>>>>>> and load methods deal with the entire entity*, while the *update(commit)
>>>>>>> method only deals with partial changes* to the entities.
>>>>>>>
>>>>>>> In the context of our current discussion, materialized view (MV)
>>>>>>> metadata is a union of view and table metadata. The fact that the update
>>>>>>> method deals only with partial changes, enables us to *reuse the
>>>>>>> existing methods for updating tables and views*. For updates we
>>>>>>> don't have to define what constitutes an entire materialized view. 
>>>>>>> Changes
>>>>>>> to a materialized view targeting the properties related to the view
>>>>>>> metadata could use the update(commit) view method. Similarly, changes
>>>>>>> targeting the properties related to the table metadata could use the
>>>>>>> update(commit) table method. This is great news because we don't have to
>>>>>>> redefine view and table commits (requirements, updates).
>>>>>>> This is shown in the fact that Jack uses the same operation to
>>>>>>> update the storage table for Option 1 and 3:
>>>>>>>
>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
>>>>>>> // non-REST: update JSON files at table_metadata_location
>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>
>>>>>>> The open question is *whether the create and load methods should
>>>>>>> treat the properties that constitute the MV metadata as two entities 
>>>>>>> (View
>>>>>>> + Table) or one entity (new MV object)*. This is all part of Jack's
>>>>>>> proposal, where Option 1 proposes a new MV object, and Option 3 proposes
>>>>>>> two separate entities. The advantage of Option 1 is that it doesn't 
>>>>>>> require
>>>>>>> two operations to load the metadata. On the other hand, the advantage of
>>>>>>> Option 3 is that no new operations or catalogs have to be defined.
>>>>>>>
>>>>>>> In my opinion, defining a new representation for materialized views
>>>>>>> (Option 1) is generally the cleaner solution. However, I see a path 
>>>>>>> where
>>>>>>> we could first introduce Option 3 and still have the possibility to
>>>>>>> transition to Option 1 if needed. The great thing about Option 3 is 
>>>>>>> that it
>>>>>>> only requires minor changes to the current spec and is mostly
>>>>>>> implementation detail.
>>>>>>>
>>>>>>> Therefore I would propose small additions to Jacks Option 3 that
>>>>>>> only introduce changes to the spec that are not specific to materialized
>>>>>>> views. The idea is to introduce boolean properties to be set on the
>>>>>>> creation of the view and the storage table that indicate that they 
>>>>>>> belong
>>>>>>> to a materialized view. The view property "materialized" is set to 
>>>>>>> "true"
>>>>>>> for a MV and "false" for a regular view. And the table property
>>>>>>> "storage_table" is set to "true" for a storage table and "false" for a
>>>>>>> regular table. The absence of these properties indicates a regular view 
>>>>>>> or
>>>>>>> table.
>>>>>>>
>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>
>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));
>>>>>>>
>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>> // non-REST: load JSON file at table_metadata_location if present
>>>>>>> Table storageTable = view.storageTable();
>>>>>>>
>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>> // non-REST: update JSON file at table_metadata_location
>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>
>>>>>>> We could then introduce a new requirement for views and tables
>>>>>>> called "AssertProperty" which could make sure to only perform updates 
>>>>>>> that
>>>>>>> are inline with materialized views. The additional requirement can be 
>>>>>>> seen
>>>>>>> as a general extension which does not need to be changed if we decide to
>>>>>>> got with Option 1 in the future.
>>>>>>>
>>>>>>> Let me know what you think.
>>>>>>>
>>>>>>> Best wishes,
>>>>>>>
>>>>>>> Jan
>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>>>>
>>>>>>> Thanks Ryan for the insights. I agree that reusing existing metadata
>>>>>>> definitions and minimizing spec changes are very important. This also
>>>>>>> minimizes spec drift (between materialized views and views spec, and
>>>>>>> between materialized views and tables spec), and simplifies the
>>>>>>> implementation.
>>>>>>>
>>>>>>> In an effort to take the discussion forward with concrete design
>>>>>>> options based on an end-to-end implementation, I have prototyped the
>>>>>>> implementation (and added Spark support) in this PR
>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps us
>>>>>>> reach convergence faster. More details about some of the design options 
>>>>>>> are
>>>>>>> discussed in the description of the PR.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>>> I mean separate table and view metadata that is somehow combined
>>>>>>>> through a commit process. For instance, keeping a pointer to a table
>>>>>>>> metadata file in a view metadata file or combining commits to reference
>>>>>>>> both. I don't see the value in either option.
>>>>>>>>
>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Ryan for the help to trace back to the root question! Just
>>>>>>>>> a clarification question regarding your reply before I reply further: 
>>>>>>>>> what
>>>>>>>>> exactly does the option "a combination of the two (i.e. commits are
>>>>>>>>> combined)" mean? How is that different from "a new metadata type"?
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I’m catching up on this conversation, so hopefully I can bring a
>>>>>>>>>> fresh perspective.
>>>>>>>>>>
>>>>>>>>>> Jack already pointed out that we need to start from the basics
>>>>>>>>>> and I agree with that. Let’s remove voting at this point. Right now 
>>>>>>>>>> is the
>>>>>>>>>> time for discussing trade-offs, not lining up and taking sides. I 
>>>>>>>>>> realize
>>>>>>>>>> that wasn’t the intent with adding a vote, but that’s almost always 
>>>>>>>>>> the
>>>>>>>>>> result. It’s too easy to use it as a stand-in for consensus and move 
>>>>>>>>>> on
>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that 
>>>>>>>>>> discussion
>>>>>>>>>> has moved ahead of agreement.
>>>>>>>>>>
>>>>>>>>>> We’re still at the most basic question: is a materialized view a
>>>>>>>>>> view and a separate table, a combination of the two (i.e. commits are
>>>>>>>>>> combined), or a new metadata type?
>>>>>>>>>>
>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some kind
>>>>>>>>>> of “system table” (meaning hidden?) or if it is exposed in the 
>>>>>>>>>> catalog.
>>>>>>>>>> That’s a later choice (already pointed out) and, I suspect, it 
>>>>>>>>>> should be
>>>>>>>>>> delegated to catalog implementations.
>>>>>>>>>>
>>>>>>>>>> To simplify this a little, I think that we can eliminate the
>>>>>>>>>> option to combine table and view commits. I don’t think there is a 
>>>>>>>>>> reason
>>>>>>>>>> to combine the two. If separate, a table would track the view 
>>>>>>>>>> version used
>>>>>>>>>> along with freshness information for referenced tables. If the table 
>>>>>>>>>> is
>>>>>>>>>> automatically skipped when the version no longer matches the view, 
>>>>>>>>>> then no
>>>>>>>>>> action needs to happen when a view definition changes. Similarly, 
>>>>>>>>>> the table
>>>>>>>>>> can be updated independently without needing to also swap view 
>>>>>>>>>> metadata.
>>>>>>>>>> This also aligns with the idea from the original doc that there can 
>>>>>>>>>> be
>>>>>>>>>> multiple materialization tables for a view. Each should operate
>>>>>>>>>> independently unless I’m missing something
>>>>>>>>>>
>>>>>>>>>> I don’t think the last paragraph’s conclusion is contentious so
>>>>>>>>>> I’ll move on, but please stop here and reply if you disagree!
>>>>>>>>>>
>>>>>>>>>> That leaves the main two options, a view and a separate table
>>>>>>>>>> linked by metadata, or, combined materialized view metadata.
>>>>>>>>>>
>>>>>>>>>> As the doc notes, the separate view and table option is simpler
>>>>>>>>>> because it reuses existing metadata definitions and falls back to 
>>>>>>>>>> simple
>>>>>>>>>> views. That is a significantly smaller spec and small is very, very
>>>>>>>>>> important when it comes to specs. I think that the argument for a new
>>>>>>>>>> definition of a materialized view needs to overcome this 
>>>>>>>>>> disadvantage.
>>>>>>>>>>
>>>>>>>>>> The arguments that I see for a combined materialized view object
>>>>>>>>>> are:
>>>>>>>>>>
>>>>>>>>>>    - Regular views are separate, rather than being tables with
>>>>>>>>>>    SQL and no data so it would be inconsistent (“Iceberg view is 
>>>>>>>>>> just a table
>>>>>>>>>>    with no data but with representations defined. But we did not do 
>>>>>>>>>> that.”)
>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>>    materialized views
>>>>>>>>>>    - Tables are not typically exposed to end users — but this
>>>>>>>>>>    isn’t required by the separate view and table option
>>>>>>>>>>
>>>>>>>>>> Am I missing any arguments for combined metadata?
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>>

-- 
Ryan Blue
Tabular

Re: Materialized view integration with REST spec

Reply via email to