To address the specific question about MV = view + storage, I do feel that
is the right approach.

(The alternative would actually fit more cleanly with the "materialized
table" concept, but there are a lot of reasons that probably isn't a great
path to go down.)

In many ways the materialized view is an extension/optimization of a view.
I understand your point that other databases (like redshift, bigquery,
snowflake) don't expose the materialized storage, however that doesn't
really apply in the same way because those are vertical solutions and can
more easily hide the internal representation.  However, not only do I think
it makes sense to expose the storage, I think it is necessary and provides
a lot of capability.

You want to be able to have multiple engines potentially participate and
have access to the underlying storage because they may have different
abilities to refresh or consume that data.  Additionally, it's important to
have a way to audit and inspect the storage table and how it changes over
time.

If you're proposing that an engine like redshift could have their own
optimized internal representation of the view, I think that's entirely
reasonable as long as it adheres to the required semantics.  Other engines
would just not see that representation and quite likely would be tracked
outside the Iceberg metadata.  However, that's more of a redshift
materialized view at that point since only redshift would really be able to
leverage the underlying storage.

I think it makes sense to have a single spec for both view and materialized
view as there is a significant overlap in the definition and behaviors.  In
fact, a materialized view is a superset of view (depending on defined
behaviors).  I think it overcomplicates things to separate the two.

I'm open to considering other approaches, but I haven't really seen a
compelling reason to take a different path than what Trino and other
engines have paved.
-Dan





On Mon, Feb 19, 2024 at 12:53 PM Jack Ye <yezhao...@gmail.com> wrote:

> I suggest we need a step-by-step process to make incremental consensus,
> otherwise we are constantly talking about many different debates at the
> same time.
>
> In my mind, the first key point we all need to agree upon to move this
> design forward is*: Do we really want to go with the MV = view + storage
> table design approach for Iceberg MV?*
>
> I think we (at least me) started with this assumption, mostly because this
> is how Trino implements MV, and how Hive tables store MV information today.
> But does it mean we should design it that way in Iceberg?
>
> Now I look back at how we did the view spec design, we could also say that
> we just add a representation field in the table spec to store view, and an
> Iceberg view is just a table with no data but with representations defined.
> But we did not do that. So it feels now quite inconsistent to say we want
> to just add a few fields in the table and view spec to call it an Iceberg
> MV.
>
> If we look into most of the other database systems (e.g. Redshift,
> BigQuery, Snowflake), they never expose such implementation details like
> storage table. Apart from being close-sourced systems, I think it is also
> for good technical reasons. There are many more things that a table needs
> to support, but does not really apply to MV. The MV internal structure
> could evolve in a way that works more efficiently with the reduced scope of
> functionalities, without relying on table to offer the same capabilities. I
> can at least say that is true based on my internal knowledge of how
> Redshift MVs work.
>
> I think we should fully evaluate both directions, and commit to one first
> before debating more things.
>
> If we have a new and independent Iceberg MV spec, then an Iceberg MV is
> under-the-hood a single object containing all MV information. It has its
> own name, snapshots, view representation, etc. I don't believe we will be
> blocked by Trino due to its MV SPIs currently requiring the existence of a
> storage table, as it will just be a different implementation from the
> existing one in Trino-Iceberg. In this direction, I don't think we need to
> have any further debate about pointers, metadata locations, storage table,
> etc. because everything will be new.
>
> If after the evaluation, we are confident that the MV = view + storage
> table approach is the right way to go, then we can debate the other issues,
> and I think the next issue to reach consensus should be "Should the storage
> table be registered in the catalog?".
>
> What do we think?
>
> -Jack
>
>
>
>
> On Mon, Feb 19, 2024 at 11:32 AM Daniel Weeks <dwe...@apache.org> wrote:
>
>> Jack,
>>
>> I think we should consider either allowing the storage table to be fully
>> exposed/addressable via the catalog or allow access via namespacing like
>> with metadata tables.  E.g. <catalog>.<database>.<table>.<storage>, which
>> would allow for full access to the underlying table.
>>
>> For other engines to interact with the storage table (e.g. to execute the
>> query to materialize the table), it may be necessary that the table is
>> fully addressable.  Whether the storage table is returned as part of list
>> operations may be something we leave up to the catalog implementation.
>>
>> I don't think the table should reference a physical location (only a
>> logical reference) since things will be changing behind the view definition
>> and I'm not confident we want to have to update the view representation
>> everytime the storage table is updated.
>>
>> I think there's still some exploration as to whether we need to model
>> this as separate from view endpoints, but there may be enough overlap that
>> it's not necessary to have yet another set of endpoints for materialized
>> views (maybe filter params if you need to distinguish?).
>>
>> -Dan
>>
>>
>>
>> On Sun, Feb 18, 2024 at 6:57 PM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>>
>>> Hi, Jack:
>>>
>>> Thanks for raising this.
>>>
>>> In most database systems, MV, view and table are considered independent
>>>> objects, at least at API level. It is very rare for a system to support
>>>> operations like "materializing a logical view" or "upgrading a logical view
>>>> to MV", because view and MV are very different in almost every aspect of
>>>> user experience. Extending the existing view or table spec to accommodate
>>>> MV might give us a MV implementation similar to the current Trino or Hive
>>>> views, save us some effort and a few APIs in REST, but it binds us to a
>>>> very specific design of MV, which we might regret in the future.
>>>
>>>
>>> When I reviewed the doc, I thought we were discussing the spec of
>>> materialized view, just like the spec of table metadata, but didn't not the
>>> user facing api. I would definitely agree that we should consider MV as
>>> another kind of database object in user facing api, even though it's
>>> internally modelled as a view + storage table pointer.
>>>
>>> If we want to make the REST experience good for MV, I think we should at
>>>> least consider directly describing the full metadata of the storage table
>>>> in Iceberg view, instead of pointing to a JSON file.
>>>
>>>
>>> Do you mean we need to add components like
>>> `LoadMaterializedViewResponse`, if so, I would +1 for this.
>>>
>>> *Q2: what REST APIs do we expect to use for interactions with MVs?*
>>>
>>>
>>> As I have mentioned above,  I think we should consider MV as another
>>> database object, so I think we should add a set of apis specifically
>>> designed for MV, such as `loadMV`, `freshMV`.
>>>
>>> On Sat, Feb 17, 2024 at 11:14 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> As we are discussing the spec change for materialized view, there has
>>>> been a missing aspect that is technically also related, and might affect
>>>> the MV spec design: *how do we want to add MV support to the REST
>>>> spec?*
>>>>
>>>> I would like to discuss this in a new thread to collect people's
>>>> thoughts. This topic expands to the following 2 sub-questions:
>>>>
>>>> *Q1: how would the MV spec change affect the REST spec?*
>>>> In the current proposal, it looks like we are using a design where a MV
>>>> is modeled as an Iceberg view linking to an Iceberg storage table. At the
>>>> same time, we do not want to expose this storage table in the catalog, thus
>>>> the Iceberg view has a pointer to only a metadata JSON file of the Iceberg
>>>> storage table. Each MV refresh updates the pointer to a new metadata JSON
>>>> file.
>>>>
>>>> I feel this does not play very well with the direction that REST is
>>>> going. The REST catalog is trying to remove the dependency to the metadata
>>>> JSON file. For example, in LoadTableResponse the only required field is the
>>>> metadata, and metadata-location is actually optional.
>>>>
>>>> If we want to make the REST experience good for MV, I think we should
>>>> at least consider directly describing the full metadata of the storage
>>>> table in Iceberg view, instead of pointing to a JSON file.
>>>>
>>>> *Q2: what REST APIs do we expect to use for interactions with MVs?*
>>>> So far we have been thinking about amending the view spec to
>>>> accommodate MV. This entails likely having MVs also being handled through
>>>> the view APIs in REST spec.
>>>>
>>>> We need to agree with that first in the community, because this has
>>>> various implications, and I am not really sure at this point if it is the
>>>> best way to go.
>>>>
>>>> If MV interactions are through the view APIs, the view APIs need to be
>>>> updated to accommodate MV constructs that are not really related to logical
>>>> views. In fact, most actions performed on MVs are more similar to actions
>>>> performed on table rather than view, which involve configuring data layout,
>>>> read and write constructs. For example, users might run something like:
>>>>
>>>> CREATE MATERIALIZED VIEW mv
>>>> PARTITION BY col1
>>>> CLUSTER BY col2
>>>> AS ( // some sql )
>>>>
>>>> then the CreateView API needs to accept partition spec and sort order
>>>> that are completely not relevant for logical views.
>>>>
>>>> When reading a MV, we might even want to have a PlanMaterializedView
>>>> API similar to the PlanTable API we are adding.
>>>>
>>>> *My personal take*
>>>> It feels like we need to reconsider the question of what is the best
>>>> way to model MV in Iceberg. Should it be (1) a view linked to a storage
>>>> table, or (2) a table with a view SQL associated with it, or (3) it's a
>>>> completely independent thing. This topic was discussed in the past in
>>>> this doc
>>>> <https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/edit?pli=1>,
>>>> but at that time we did not have much perspective about aspects like REST
>>>> spec, and the view integration was also not fully completed yet. With the
>>>> new knowledge, currently I am actually leaning a bit more towards (3).
>>>>
>>>> In most database systems, MV, view and table are considered independent
>>>> objects, at least at API level. It is very rare for a system to support
>>>> operations like "materializing a logical view" or "upgrading a logical view
>>>> to MV", because view and MV are very different in almost every aspect of
>>>> user experience. Extending the existing view or table spec to accommodate
>>>> MV might give us a MV implementation similar to the current Trino or Hive
>>>> views, save us some effort and a few APIs in REST, but it binds us to a
>>>> very specific design of MV, which we might regret in the future.
>>>>
>>>> If we make a new MV spec, it can be made up of fields that already
>>>> exist in the table and view specs, but it is a whole new spec. In this way,
>>>> the spec can evolve independently to accommodate MV specific features, and
>>>> we can also create MV-related REST endpoints that will evolve independently
>>>> from table and view REST APIs.
>>>>
>>>> But on the other side it is definitely associated with more work to
>>>> maintain a new spec, and potentially big refactoring in the codebase to
>>>> make sure operations today that work on table or view can now support MV as
>>>> a different object. And it definitely has other problems that I have
>>>> overlooked. I would greatly appreciate any thoughts about this!
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>

Reply via email to