Thanks Ryan for the insights. I agree that reusing existing metadata
definitions and minimizing spec changes are very important. This also
minimizes spec drift (between materialized views and views spec, and
between materialized views and tables spec), and simplifies the
implementation.

In an effort to take the discussion forward with concrete design options
based on an end-to-end implementation, I have prototyped the
implementation (and added Spark support) in this PR
https://github.com/apache/iceberg/pull/9830. I hope it helps us reach
convergence faster. More details about some of the design options are
discussed in the description of the PR.

Thanks,
Walaa.


On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote:

> I mean separate table and view metadata that is somehow combined through a
> commit process. For instance, keeping a pointer to a table metadata file in
> a view metadata file or combining commits to reference both. I don't see
> the value in either option.
>
> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Thanks Ryan for the help to trace back to the root question! Just a
>> clarification question regarding your reply before I reply further: what
>> exactly does the option "a combination of the two (i.e. commits are
>> combined)" mean? How is that different from "a new metadata type"?
>>
>> -Jack
>>
>>
>>
>>
>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> I’m catching up on this conversation, so hopefully I can bring a fresh
>>> perspective.
>>>
>>> Jack already pointed out that we need to start from the basics and I
>>> agree with that. Let’s remove voting at this point. Right now is the time
>>> for discussing trade-offs, not lining up and taking sides. I realize that
>>> wasn’t the intent with adding a vote, but that’s almost always the result.
>>> It’s too easy to use it as a stand-in for consensus and move on
>>> prematurely. I get the impression from the swirl in Slack that discussion
>>> has moved ahead of agreement.
>>>
>>> We’re still at the most basic question: is a materialized view a view
>>> and a separate table, a combination of the two (i.e. commits are combined),
>>> or a new metadata type?
>>>
>>> For now, I’m ignoring whether the “separate table” is some kind of
>>> “system table” (meaning hidden?) or if it is exposed in the catalog. That’s
>>> a later choice (already pointed out) and, I suspect, it should be delegated
>>> to catalog implementations.
>>>
>>> To simplify this a little, I think that we can eliminate the option to
>>> combine table and view commits. I don’t think there is a reason to combine
>>> the two. If separate, a table would track the view version used along with
>>> freshness information for referenced tables. If the table is automatically
>>> skipped when the version no longer matches the view, then no action needs
>>> to happen when a view definition changes. Similarly, the table can be
>>> updated independently without needing to also swap view metadata. This also
>>> aligns with the idea from the original doc that there can be multiple
>>> materialization tables for a view. Each should operate independently unless
>>> I’m missing something
>>>
>>> I don’t think the last paragraph’s conclusion is contentious so I’ll
>>> move on, but please stop here and reply if you disagree!
>>>
>>> That leaves the main two options, a view and a separate table linked by
>>> metadata, or, combined materialized view metadata.
>>>
>>> As the doc notes, the separate view and table option is simpler because
>>> it reuses existing metadata definitions and falls back to simple views.
>>> That is a significantly smaller spec and small is very, very important when
>>> it comes to specs. I think that the argument for a new definition of a
>>> materialized view needs to overcome this disadvantage.
>>>
>>> The arguments that I see for a combined materialized view object are:
>>>
>>>    - Regular views are separate, rather than being tables with SQL and
>>>    no data so it would be inconsistent (“Iceberg view is just a table with 
>>> no
>>>    data but with representations defined. But we did not do that.”)
>>>    - Materialized views are different objects in DDL
>>>    - Tables may be a superset of functionality needed for materialized
>>>    views
>>>    - Tables are not typically exposed to end users — but this isn’t
>>>    required by the separate view and table option
>>>
>>> Am I missing any arguments for combined metadata?
>>>
>>> Ryan
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to