Re: Materialized view integration with REST spec

Jack Ye Fri, 01 Mar 2024 15:43:41 -0800

Sounds good, let's discuss this in person!

I am a bit worried that we have quite a few critical topics going on right
now on devlist, and this will take up a lot of time to discuss. If it ends
up going for too long, l propose let us have a dedicated meeting, and I am
more than happy to organize it.


Best,
Jack Ye

On Fri, Mar 1, 2024 at 12:48 PM Ryan Blue <b...@tabular.io> wrote:

> Hey everyone,
>
> I think this thread has hit a point of diminishing returns and that we
> still don't have a common understanding of what the options under
> consideration actually are.
>
> Since we were already planning on discussing this at the next community
> sync, I suggest we pick this up there and use that time to align on what
> exactly we're considering. We can then start a new thread to lay out the
> designs under consideration in more detail and then have a discussion about
> trade-offs.
>
> Does that sound reasonable?
>
> Ryan
>
>
> On Fri, Mar 1, 2024 at 11:09 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> I am finding it hard to interpret the options concretely. I would also
>> suggest breaking the expectation/outcome to milestones. Maybe it becomes
>> easier if we agree to distinguish between an approach that is feasible in
>> the near term and another in the long term, especially if the latter
>> requires significant engine-side changes.
>>
>> Further, maybe it helps if we start with an option that fully reuses the
>> existing spec, and see how we view it in comparison with the options
>> discussed previously. I am sharing one below. It reuses the current spec of
>> Iceberg views and tables by leveraging table properties to capture
>> materialized view metadata. What is common (and not common) between this
>> and the desired representations?
>>
>> The new properties are:
>> Properties on a View:
>>
>>    1.
>>
>>    *iceberg.materialized.view*:
>>    - *Type*: View property
>>       - *Purpose*: This property is used to mark whether a view is a
>>       materialized view. If set to true, the view is treated as a
>>       materialized view. This helps in differentiating between virtual and
>>       materialized views within the catalog and dictates specific handling 
>> and
>>       validation logic for materialized views.
>>    2.
>>
>>    *iceberg.materialized.view.storage.location*:
>>    - *Type*: View property
>>       - *Purpose*: Specifies the location of the storage table
>>       associated with the materialized view. This property is used for 
>> linking a
>>       materialized view with its corresponding storage table, enabling data
>>       management and query execution based on the stored data freshness.
>>
>> Properties on a Table:
>>
>>    1. *base.snapshot.[UUID]*:
>>       - *Type*: Table property
>>       - *Purpose*: These properties store the snapshot IDs of the base
>>       tables at the time the materialized view's data was last updated. Each
>>       property is prefixed with base.snapshot. followed by the UUID of
>>       the base table. They are used to track whether the materialized view's 
>> data
>>       is up to date with the base tables by comparing these snapshot IDs 
>> with the
>>       current snapshot IDs of the base tables. If all the base tables' 
>> current
>>       snapshot IDs match the ones stored in these properties, the 
>> materialized
>>       view's data is considered fresh.
>>
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Fri, Mar 1, 2024 at 9:15 AM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> > All of these approaches are aligned in one, specific way: the storage
>>> table is an iceberg table.
>>>
>>> I do not think that is true. I think people are aligned that we would
>>> like to re-use the Iceberg table metadata defined in the Iceberg table spec
>>> to express the data in MV, but I don't think it goes that far to say it
>>> must be an Iceberg table. Once you have that mindset, then of course option
>>> 1 (separate table and view) is the only option.
>>>
>>> > I don't think that is necessary and it significantly increases the
>>> complexity.
>>>
>>> And can you quantify what you mean by "significantly increases the
>>> complexity"? Seems like a lot of concerns are coming from the tradeoff with
>>> complexity. We probably all agree that using option 7 (a completely new
>>> metadata type) is a lot of work from scratch, that is why it is not
>>> favored. However, my understanding is that as long as we re-use the view
>>> and table metadata, then the majority of the existing logic can be reused.
>>> I think what we have gone through in Slack to draft the rough Java API
>>> shape helps here, because people can estimate the amount of effort required
>>> to implement it. And I don't think they are **significantly** more complex
>>> to implement. Could you elaborate more about the complexity that you
>>> imagine?
>>>
>>> -Jack
>>>
>>>
>>>
>>> On Fri, Mar 1, 2024 at 8:57 AM Daniel Weeks <daniel.c.we...@gmail.com>
>>> wrote:
>>>
>>>> I feel I've been most vocal about pushing back against options 2+ (or
>>>> Ryan's categories of combined table/view, or new metadata type), so I'll
>>>> try to expand on my reasoning.
>>>>
>>>> I understand the appeal of creating a design where we encapsulate the
>>>> view/storage from both a structural and performance standpoint, but I don't
>>>> think that is necessary and it significantly increases the complexity.
>>>>
>>>> All of these approaches are aligned in one, specific way: the storage
>>>> table is an iceberg table.
>>>>
>>>> Because of this, all the behaviors and requirements still apply to
>>>> these tables.  They need to be maintained (snapshot cleanup, orphan files),
>>>> in cases need to be optimized (compaction, manifest rewrites), they need to
>>>> be able to be inspected (this will be even more important with MV since
>>>> staleness can produce different results and questions will arise about what
>>>> state the storage table was in).  There may be cases where the tables need
>>>> to be managed directly.
>>>>
>>>> Anywhere we deviate from the existing constructs/commit/access for
>>>> tables, we will ultimately have to then unwrap to re-expose the underlying
>>>> Iceberg behavior.  This creates unnecessary complexity in the library/API
>>>> layer, which are not the primary interface users will have with
>>>> materialized views where an engine is almost entirely necessary to interact
>>>> with the dataset.
>>>>
>>>> As to the performance concerns around option 1, I think we're
>>>> overstating the downsides.  It really comes down to how many metadata loads
>>>> are necessary and evaluating freshness would likely be the real bottleneck
>>>> as it involves potentially loading many tables.  All of the options are on
>>>> the same order of performance for the metadata and table loads.
>>>>
>>>> As to the visibility of tables and whether they're registered in the
>>>> catalog, I think registering in the catalog is the right approach so that
>>>> the tables are still addressable for maintenance/etc.  The visibility of
>>>> the storage table is a catalog implementation decision and shouldn't be a
>>>> requirement of the MV spec (I can see cases for both and it isn't necessary
>>>> to dictate a behavior).
>>>>
>>>> I'm still strongly in favor of Option 1 (separate table and view) for
>>>> these reasons.
>>>>
>>>> -Dan
>>>>
>>>>
>>>>
>>>> On Thu, Feb 29, 2024 at 11:07 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> > Jack, it sounds like you’re the proponent of a combined table and
>>>>> view (rather than a new metadata spec for a materialized view). What is 
>>>>> the
>>>>> main motivation? It seems like you’re convinced of that approach, but I
>>>>> don’t understand the advantage it brings.
>>>>>
>>>>> Sorry I have to make a Google Sheet to capture all the options we have
>>>>> discussed so far, I wanted to use the existing Google Doc, but it has
>>>>> really bad table/sheet support...
>>>>>
>>>>>
>>>>> https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0
>>>>>
>>>>> I have listed all the options, with how they are implemented and some
>>>>> important considerations we have discussed so far. Note that:
>>>>> 1. This sheet currently excludes the lineage information, which we can
>>>>> discuss more later after the current topic is resolved.
>>>>> 2. I removed the considerations for REST integration since from the
>>>>> other thread we have clarified that they should be considered completely
>>>>> separately.
>>>>>
>>>>> *Why I come as a proponent of having a new MV object with table and
>>>>> view metadata file pointer*
>>>>>
>>>>> In my sheet, there are 3 options that do not have major problems:
>>>>> Option 2: Add storage table metadata file pointer in view object
>>>>> Option 5: New MV object with table and view metadata file pointer
>>>>> Option 6: New MV spec with table and view metadata
>>>>>
>>>>> I originally excluded option 2 because I think it does not align with
>>>>> the REST spec, but after the other discussion thread about "Inconsistency
>>>>> between REST spec and table/view spec", I think my original concern no
>>>>> longer holds true so now I put it back. And based on my personal
>>>>> preference that MV is an independent object that should be separated from
>>>>> view and table, plus the fact that option 5 is probably less work than
>>>>> option 6 for implementation, that is how I come as a proponent of option 5
>>>>> at this moment.
>>>>>
>>>>>
>>>>> *Regarding Ryan's evaluation framework*
>>>>>
>>>>> I think we need to reconcile this sheet with Ryan's evaluation
>>>>> framework. That framework categorization puts option 2, 3, 4, 5, 6 all
>>>>> under the same category of "A combination of a view and a table" and
>>>>> concludes that they don't have any advantage for the same set of reasons.
>>>>> But those reasons are not really convincing to me so let's talk about them
>>>>> in more detail.
>>>>>
>>>>> (1) You said "I don’t see a reason why a combined view and table is
>>>>> advantageous" as "this would cause unnecessary dependence between the view
>>>>> and table in catalogs."  What dependency exactly do you mean here? And why
>>>>> is that unnecessary, given there has to be some sort of dependency anyway
>>>>> unless we go with option 5 or 6?
>>>>>
>>>>> (2) You said "I guess there’s an argument that you could load both
>>>>> table and view metadata locations at the same time. That hardly seems 
>>>>> worth
>>>>> the trouble". I disagree with that. Catalog interaction performance is
>>>>> critical to at least everyone working in EMR and Athena, and MV itself as
>>>>> an acceleration approach needs to be as fast as possible.
>>>>>
>>>>> I have put 3 key operations in the doc that I think matters for MV
>>>>> during interactions with engine:
>>>>> 1. refreshes storage table
>>>>> 2. get the storage table of the MV
>>>>> 3. if stale, get the view SQL
>>>>>
>>>>> And option 1 clearly falls short with 4 sequential steps required to
>>>>> load a storage table. You mentioned "recent issues with adding views to 
>>>>> the
>>>>> JDBC catalog" in this topic, could you explain a bit more?
>>>>>
>>>>> (3) You said "I also think that once we decide on structure, we can
>>>>> make it possible for REST catalog implementations to do smart things, in a
>>>>> way that doesn’t put additional requirements on the underlying catalog
>>>>> store." If REST is fully compatible with Iceberg spec then I have no
>>>>> problem with this statement. However, as we discussed in the other thread,
>>>>> it is not the case. In the current state, I think the sequence of action
>>>>> should be to evolve the Iceberg table/view spec (or add a MV spec) first,
>>>>> and then think about how REST can incorporate it or do smart things that
>>>>> are not Iceberg spec compliant. Do you agree with that?
>>>>>
>>>>> (4) You said the table identifier pointer "is a problem we need to
>>>>> solve generally because a materialized table needs to be able to track the
>>>>> upstream state of tables that were used". I don't think that is a reason 
>>>>> to
>>>>> choose to use a table identifier pointer for a storage table. The issue is
>>>>> not about using a table identifier pointer. It is about exposing the
>>>>> storage table as a separate entity in the catalog, which is what people do
>>>>> not like and is already discussed in length in Jan's question 3 (also
>>>>> linked in the sheet). I agree with that statement, because without a REST
>>>>> implementation that can magically hide the storage table, this model adds
>>>>> additional burden regarding compliance and data governance for any other
>>>>> non-REST catalog implementations that are compliant to the Iceberg spec.
>>>>> Many mechanisms need to be built in a catalog to hide, protect, maintain,
>>>>> recycle the storage table, that can be avoided by using other approaches. 
>>>>> I
>>>>> think we should reach a consensus about that and discuss further if you do
>>>>> not agree.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>> On Thu, Feb 29, 2024 at 10:53 PM Jan Kaul <jank...@mailbox.org.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi Ryan, we actually discussed your categories in this question
>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi>.
>>>>>> Where your categories correspond to the following designs:
>>>>>>
>>>>>>    - Separate table and view => Design 1
>>>>>>    - Combination of view and table => Design 2
>>>>>>    - A new metadata type => Design 4
>>>>>>
>>>>>> Jan
>>>>>> On 01.03.24 00:03, Ryan Blue wrote:
>>>>>>
>>>>>> Looks like it wasn’t clear what I meant for the 3 categories, so I’ll
>>>>>> be more specific:
>>>>>>
>>>>>>    - *Separate table and view*: this option is to have the objects
>>>>>>    that we have today, with extra metadata. Commit processes are 
>>>>>> separate:
>>>>>>    committing to the table doesn’t alter the view and committing to the 
>>>>>> view
>>>>>>    doesn’t change the table. However, changing the view can make it so 
>>>>>> the
>>>>>>    table is no longer useful as a materialization.
>>>>>>    - *A combination of a view and a table*: in this option, the
>>>>>>    table metadata and view metadata are the same as the first option. The
>>>>>>    difference is that the commit process combines them, either by 
>>>>>> embedding a
>>>>>>    table metadata location in view metadata or by tracking both in the 
>>>>>> same
>>>>>>    catalog reference.
>>>>>>    - *A new metadata type*: this option is where we define a new
>>>>>>    metadata object that has view attributes, like SQL representations, 
>>>>>> along
>>>>>>    with table attributes, like partition specs and snapshots.
>>>>>>
>>>>>> Hopefully this is clear because I think much of the confusion is
>>>>>> caused by different definitions.
>>>>>>
>>>>>> The LoadTableResponse having optional metadata-location field implies
>>>>>> that the object in the catalog no longer needs to hold a metadata file
>>>>>> pointer
>>>>>>
>>>>>> The REST protocol has not removed the requirement for a metadata
>>>>>> file, so I’m going to keep focused on the MV design options.
>>>>>>
>>>>>> When we say a MV can be a “new metadata type”, it does not mean it
>>>>>> needs to define a completely brand new structure of the metadata content
>>>>>>
>>>>>> I’m making a distinction between separate metadata files for the
>>>>>> table and the view and a combined metadata object, as above.
>>>>>>
>>>>>> We can define an “Iceberg MV” to be an object in a catalog, which has
>>>>>> 1 table metadata file pointer, and 1 view metadata file pointer
>>>>>>
>>>>>> This is the option I am referring to as a “combination of a view and
>>>>>> a table”.
>>>>>>
>>>>>> So to review my initial email, I don’t see a reason why a combined
>>>>>> view and table is advantageous, either implemented by having a catalog
>>>>>> reference with two metadata locations or embedding a table metadata
>>>>>> location in view metadata. This would cause unnecessary dependence 
>>>>>> between
>>>>>> the view and table in catalogs. I guess there’s an argument that you 
>>>>>> could
>>>>>> load both table and view metadata locations at the same time. That hardly
>>>>>> seems worth the trouble given the recent issues with adding views to the
>>>>>> JDBC catalog.
>>>>>>
>>>>>> I also think that once we decide on structure, we can make it
>>>>>> possible for REST catalog implementations to do smart things, in a way 
>>>>>> that
>>>>>> doesn’t put additional requirements on the underlying catalog store. For
>>>>>> instance, we could specify how to send additional objects in a
>>>>>> LoadViewResult, in case the catalog wants to pre-fetch table metadata. I
>>>>>> think these optimizations are a later addition, after we define the
>>>>>> relationship between views and tables.
>>>>>>
>>>>>> Jack, it sounds like you’re the proponent of a combined table and
>>>>>> view (rather than a new metadata spec for a materialized view). What is 
>>>>>> the
>>>>>> main motivation? It seems like you’re convinced of that approach, but I
>>>>>> don’t understand the advantage it brings.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Yes I mostly agree with the assessment.  To clarify a few minor
>>>>>>> points.
>>>>>>>
>>>>>>> is a materialized view a view and a separate table, a combination of
>>>>>>>> the two (i.e. commits are combined), or a new metadata type?
>>>>>>>
>>>>>>>
>>>>>>> For 'new metadata type', I consider mostly Jack's initial proposal
>>>>>>> of a new Catalog MV object that has two references (ViewMetadata +
>>>>>>> TableMetadata).
>>>>>>>
>>>>>>> The arguments that I see for a combined materialized view object
>>>>>>>> are:
>>>>>>>>
>>>>>>>>    - Regular views are separate, rather than being tables with SQL
>>>>>>>>    and no data so it would be inconsistent (“Iceberg view is just a 
>>>>>>>> table with
>>>>>>>>    no data but with representations defined. But we did not do that.”)
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>    materialized views
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Tables are not typically exposed to end users — but this
>>>>>>>>    isn’t required by the separate view and table option
>>>>>>>>
>>>>>>>> For completeness, there seem to be a few additional ones (mentioned
>>>>>>> in the Slack and above messages).
>>>>>>>
>>>>>>>    - Lack of spec change (to ViewMetadata).  But as Jack says it is
>>>>>>>    a spec change (ie, to catalogs)
>>>>>>>    - A single call to get the View's StorageTable (versus two calls)
>>>>>>>    - A more natural API, no opportunity for user to call
>>>>>>>    Catalog.dropTable() and renameTable() on storage table
>>>>>>>
>>>>>>>
>>>>>>> *Thoughts:  *I think the long discussion sessions we had on Slack
>>>>>>> was fruitful for me, as seeing the API clarified some things.
>>>>>>>
>>>>>>> I was initially more in favor of MV being a new metadata type
>>>>>>> (TableMetadata + ViewMetadata).  But seeing most of the MV operations 
>>>>>>> end
>>>>>>> up being ViewCatalog or Catalog operations, I am starting to think 
>>>>>>> API-wise
>>>>>>> that it may not align with the new metadata type (unless we define
>>>>>>> MVCatalog and /MV REST endpoints, which then are boilerplate wrappers).
>>>>>>>
>>>>>>> Initially one question I had for option 'a view and a separate
>>>>>>> table', was how to make this table reference (metadata.json or catalog
>>>>>>> reference).  In the previous option, we had a precedent of Catalog
>>>>>>> references to Metadata, but not pointers between Metadatas.  I initially
>>>>>>> saw the proposed Catalog's TableIdentifier pointer as 'polluting' 
>>>>>>> catalog
>>>>>>> concerns in ViewMetadata.  (I saw Catalog and ViewCatalog as a layer 
>>>>>>> above
>>>>>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack made a 
>>>>>>> fair
>>>>>>> point that ViewMetadata already is tightly bound with a Catalog.  In 
>>>>>>> this
>>>>>>> case, I think this approach does have its merits as well in aligning
>>>>>>> Catalog API's with the metadata.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Szehon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I would like to provide my perspective on the question of what a
>>>>>>>> materialized view is and elaborate on Jack's recent proposal to view a
>>>>>>>> materialized view as a catalog concept.
>>>>>>>>
>>>>>>>> Firstly, let's look at the role of the catalog. Every entity in the
>>>>>>>> catalog has a *unique identifier*, and the catalog provides
>>>>>>>> methods to create, load, and update these entities. An important thing 
>>>>>>>> to
>>>>>>>> note is that the catalog methods exhibit two different behaviors: the 
>>>>>>>> *create
>>>>>>>> and load methods deal with the entire entity*, while the 
>>>>>>>> *update(commit)
>>>>>>>> method only deals with partial changes* to the entities.
>>>>>>>>
>>>>>>>> In the context of our current discussion, materialized view (MV)
>>>>>>>> metadata is a union of view and table metadata. The fact that the 
>>>>>>>> update
>>>>>>>> method deals only with partial changes, enables us to *reuse the
>>>>>>>> existing methods for updating tables and views*. For updates we
>>>>>>>> don't have to define what constitutes an entire materialized view. 
>>>>>>>> Changes
>>>>>>>> to a materialized view targeting the properties related to the view
>>>>>>>> metadata could use the update(commit) view method. Similarly, changes
>>>>>>>> targeting the properties related to the table metadata could use the
>>>>>>>> update(commit) table method. This is great news because we don't have 
>>>>>>>> to
>>>>>>>> redefine view and table commits (requirements, updates).
>>>>>>>> This is shown in the fact that Jack uses the same operation to
>>>>>>>> update the storage table for Option 1 and 3:
>>>>>>>>
>>>>>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
>>>>>>>> // non-REST: update JSON files at table_metadata_location
>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>
>>>>>>>> The open question is *whether the create and load methods should
>>>>>>>> treat the properties that constitute the MV metadata as two entities 
>>>>>>>> (View
>>>>>>>> + Table) or one entity (new MV object)*. This is all part of
>>>>>>>> Jack's proposal, where Option 1 proposes a new MV object, and Option 3
>>>>>>>> proposes two separate entities. The advantage of Option 1 is that it
>>>>>>>> doesn't require two operations to load the metadata. On the other 
>>>>>>>> hand, the
>>>>>>>> advantage of Option 3 is that no new operations or catalogs have to be
>>>>>>>> defined.
>>>>>>>>
>>>>>>>> In my opinion, defining a new representation for materialized views
>>>>>>>> (Option 1) is generally the cleaner solution. However, I see a path 
>>>>>>>> where
>>>>>>>> we could first introduce Option 3 and still have the possibility to
>>>>>>>> transition to Option 1 if needed. The great thing about Option 3 is 
>>>>>>>> that it
>>>>>>>> only requires minor changes to the current spec and is mostly
>>>>>>>> implementation detail.
>>>>>>>>
>>>>>>>> Therefore I would propose small additions to Jacks Option 3 that
>>>>>>>> only introduce changes to the spec that are not specific to 
>>>>>>>> materialized
>>>>>>>> views. The idea is to introduce boolean properties to be set on the
>>>>>>>> creation of the view and the storage table that indicate that they 
>>>>>>>> belong
>>>>>>>> to a materialized view. The view property "materialized" is set to 
>>>>>>>> "true"
>>>>>>>> for a MV and "false" for a regular view. And the table property
>>>>>>>> "storage_table" is set to "true" for a storage table and "false" for a
>>>>>>>> regular table. The absence of these properties indicates a regular 
>>>>>>>> view or
>>>>>>>> table.
>>>>>>>>
>>>>>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>>>>>
>>>>>>>> // REST: GET /namespaces/db1/views/mv1
>>>>>>>> // non-REST: load JSON file at metadata_location
>>>>>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));
>>>>>>>>
>>>>>>>> // REST: GET /namespaces/db1/tables/mv1
>>>>>>>> // non-REST: load JSON file at table_metadata_location if present
>>>>>>>> Table storageTable = view.storageTable();
>>>>>>>>
>>>>>>>> // REST: POST /namespaces/db1/tables/mv1
>>>>>>>> // non-REST: update JSON file at table_metadata_location
>>>>>>>> storageTable.newAppend().appendFile(...).commit();
>>>>>>>>
>>>>>>>> We could then introduce a new requirement for views and tables
>>>>>>>> called "AssertProperty" which could make sure to only perform updates 
>>>>>>>> that
>>>>>>>> are inline with materialized views. The additional requirement can be 
>>>>>>>> seen
>>>>>>>> as a general extension which does not need to be changed if we decide 
>>>>>>>> to
>>>>>>>> got with Option 1 in the future.
>>>>>>>>
>>>>>>>> Let me know what you think.
>>>>>>>>
>>>>>>>> Best wishes,
>>>>>>>>
>>>>>>>> Jan
>>>>>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>>>>>
>>>>>>>> Thanks Ryan for the insights. I agree that reusing existing
>>>>>>>> metadata definitions and minimizing spec changes are very important. 
>>>>>>>> This
>>>>>>>> also minimizes spec drift (between materialized views and views spec, 
>>>>>>>> and
>>>>>>>> between materialized views and tables spec), and simplifies the
>>>>>>>> implementation.
>>>>>>>>
>>>>>>>> In an effort to take the discussion forward with concrete design
>>>>>>>> options based on an end-to-end implementation, I have prototyped the
>>>>>>>> implementation (and added Spark support) in this PR
>>>>>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps us
>>>>>>>> reach convergence faster. More details about some of the design 
>>>>>>>> options are
>>>>>>>> discussed in the description of the PR.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> I mean separate table and view metadata that is somehow combined
>>>>>>>>> through a commit process. For instance, keeping a pointer to a table
>>>>>>>>> metadata file in a view metadata file or combining commits to 
>>>>>>>>> reference
>>>>>>>>> both. I don't see the value in either option.
>>>>>>>>>
>>>>>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Ryan for the help to trace back to the root question! Just
>>>>>>>>>> a clarification question regarding your reply before I reply 
>>>>>>>>>> further: what
>>>>>>>>>> exactly does the option "a combination of the two (i.e. commits are
>>>>>>>>>> combined)" mean? How is that different from "a new metadata type"?
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I’m catching up on this conversation, so hopefully I can bring a
>>>>>>>>>>> fresh perspective.
>>>>>>>>>>>
>>>>>>>>>>> Jack already pointed out that we need to start from the basics
>>>>>>>>>>> and I agree with that. Let’s remove voting at this point. Right now 
>>>>>>>>>>> is the
>>>>>>>>>>> time for discussing trade-offs, not lining up and taking sides. I 
>>>>>>>>>>> realize
>>>>>>>>>>> that wasn’t the intent with adding a vote, but that’s almost always 
>>>>>>>>>>> the
>>>>>>>>>>> result. It’s too easy to use it as a stand-in for consensus and 
>>>>>>>>>>> move on
>>>>>>>>>>> prematurely. I get the impression from the swirl in Slack that 
>>>>>>>>>>> discussion
>>>>>>>>>>> has moved ahead of agreement.
>>>>>>>>>>>
>>>>>>>>>>> We’re still at the most basic question: is a materialized view a
>>>>>>>>>>> view and a separate table, a combination of the two (i.e. commits 
>>>>>>>>>>> are
>>>>>>>>>>> combined), or a new metadata type?
>>>>>>>>>>>
>>>>>>>>>>> For now, I’m ignoring whether the “separate table” is some kind
>>>>>>>>>>> of “system table” (meaning hidden?) or if it is exposed in the 
>>>>>>>>>>> catalog.
>>>>>>>>>>> That’s a later choice (already pointed out) and, I suspect, it 
>>>>>>>>>>> should be
>>>>>>>>>>> delegated to catalog implementations.
>>>>>>>>>>>
>>>>>>>>>>> To simplify this a little, I think that we can eliminate the
>>>>>>>>>>> option to combine table and view commits. I don’t think there is a 
>>>>>>>>>>> reason
>>>>>>>>>>> to combine the two. If separate, a table would track the view 
>>>>>>>>>>> version used
>>>>>>>>>>> along with freshness information for referenced tables. If the 
>>>>>>>>>>> table is
>>>>>>>>>>> automatically skipped when the version no longer matches the view, 
>>>>>>>>>>> then no
>>>>>>>>>>> action needs to happen when a view definition changes. Similarly, 
>>>>>>>>>>> the table
>>>>>>>>>>> can be updated independently without needing to also swap view 
>>>>>>>>>>> metadata.
>>>>>>>>>>> This also aligns with the idea from the original doc that there can 
>>>>>>>>>>> be
>>>>>>>>>>> multiple materialization tables for a view. Each should operate
>>>>>>>>>>> independently unless I’m missing something
>>>>>>>>>>>
>>>>>>>>>>> I don’t think the last paragraph’s conclusion is contentious so
>>>>>>>>>>> I’ll move on, but please stop here and reply if you disagree!
>>>>>>>>>>>
>>>>>>>>>>> That leaves the main two options, a view and a separate table
>>>>>>>>>>> linked by metadata, or, combined materialized view metadata.
>>>>>>>>>>>
>>>>>>>>>>> As the doc notes, the separate view and table option is simpler
>>>>>>>>>>> because it reuses existing metadata definitions and falls back to 
>>>>>>>>>>> simple
>>>>>>>>>>> views. That is a significantly smaller spec and small is very, very
>>>>>>>>>>> important when it comes to specs. I think that the argument for a 
>>>>>>>>>>> new
>>>>>>>>>>> definition of a materialized view needs to overcome this 
>>>>>>>>>>> disadvantage.
>>>>>>>>>>>
>>>>>>>>>>> The arguments that I see for a combined materialized view object
>>>>>>>>>>> are:
>>>>>>>>>>>
>>>>>>>>>>>    - Regular views are separate, rather than being tables with
>>>>>>>>>>>    SQL and no data so it would be inconsistent (“Iceberg view is 
>>>>>>>>>>> just a table
>>>>>>>>>>>    with no data but with representations defined. But we did not do 
>>>>>>>>>>> that.”)
>>>>>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>>>>>    materialized views
>>>>>>>>>>>    - Tables are not typically exposed to end users — but this
>>>>>>>>>>>    isn’t required by the separate view and table option
>>>>>>>>>>>
>>>>>>>>>>> Am I missing any arguments for combined metadata?
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: Materialized view integration with REST spec

Reply via email to