Re: Materialized view integration with REST spec

Szehon Ho Thu, 22 Feb 2024 11:13:52 -0800

Hi Jan

I agree with Walaa, I think the new Question should be narrow (View = View
+ Materialization, or new MV metadata), with 3 options (Materialization can
be metadata.json or nested object).


We can mention that with the former, we have another decision whether to
register it (and then refer to Question 3 already discussed in the
document).  Otherwise we will have n^2 options here and its hard to
understand.

What do you think?
Thanks
Szehon

On Thu, Feb 22, 2024 at 1:52 AM Jan Kaul <[email protected]>
wrote:

> My motivation for the current table is to answer the question:
>
> *Do we use a View + a Storage Table or do we define a new MV metadata
> format? *To be able to provide meaningful arguments about the View +
> Storage Table option, I split it into multiple options. Otherwise arguments
> would always need to include an additional condition like:
>
> The downside of the View + Storage Table design is that two entities have
> to be registered in the catalog, if the storage table metadata is not
> stored as a JSON file or as an internal field.
>
> We can come back to the more granular questions once the aforementioned
> question is answered.
> On 22.02.24 06:04, Walaa Eldin Moustafa wrote:
>
> Thanks Jack! I feel Question 0 is very broad, essentially capturing the
> whole design. Can we start by discussing more granular questions?
>
> On Wed, Feb 21, 2024 at 8:53 PM Jack Ye <[email protected]> wrote:
>
>> Thanks everyone for the help in organizing the thoughts!
>>
>> I have moved the summary of everyone's comments here also to the doc that
>> Jan linked under question 0. We can continue to have more discussions there
>> and cast votes!
>>
>> Best,
>> Jack Ye
>>
>> On Wed, Feb 21, 2024 at 12:14 PM Jan Kaul <[email protected]>
>> <[email protected]> wrote:
>>
>>> Thanks Micah, I think the voting chips are great.
>>>
>>> @Szehon, actually what I had in mind was not to have one thread per
>>> question but rather have smaller threads that can be resolved more easily.
>>> I have the fear that one thread for the current question would lead to a
>>> very long and unmanageable discussion.
>>>
>>> I've added another row to the table where everyone could provide a
>>> summary of their reason for choosing a certain design. This way we could
>>> move some of the content from the comment threads to the main document.
>>> On 21.02.24 19:58, Micah Kornfield wrote:
>>>
>>> Of course we also need threads that express our preferences (voting). I
>>>> would suggest to keep these separate from discussions about single points
>>>> so that they can be persisted in the document.
>>>
>>>
>>> Not sure if it helpful, but I added voting chips Question 0, as maybe an
>>> easier way to keep track of votes.  If it is helpful, I can add them in
>>> other places that still need a vote (I think one needs a paid Google Docs
>>> account to insert them).
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Wed, Feb 21, 2024 at 10:23 AM Szehon Ho <[email protected]>
>>> wrote:
>>>
>>>> Thanks Jan.  +1 on having just one thread per question for
>>>> vote/preference.  Where do you suggest we have it, on the discussion
>>>> question itself?  It would be to keep the existing threads and move it
>>>> there.
>>>>
>>>> Also, I think it makes sense with making a slack channel (for quick
>>>> question, reply) , and also discuss unresolved questions in the next week's
>>>> sync or a separate meeting.
>>>>
>>>> On Wed, Feb 21, 2024 at 12:40 AM Jan Kaul <[email protected]>
>>>> <[email protected]> wrote:
>>>>
>>>>> Thank you Jack for driving the consensus for the MV spec and thank you
>>>>> all for the discussion.
>>>>>
>>>>> I really like the idea about incremental consensus because we often
>>>>> loose sight in detailed discussions. As Jack mentioned, the highest
>>>>> priority question currently is:
>>>>>
>>>>> *Should the Iceberg MV be realized as a view + storage table or do we
>>>>> define a new metadata format? *To have one place for the discussion,
>>>>> I created another Question (
>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi)
>>>>> to the Materialized View Spec google document.
>>>>>
>>>>> To improve the visibility of the arguments I would like to propose a
>>>>> new process. It would be great if all relevant information is stored in 
>>>>> the
>>>>> document itself. Therefore I would suggest to use the comment threads for
>>>>> smaller, temporary discussions which can be resolved by adding the points
>>>>> to the main document. Please close the threads if the information was 
>>>>> added
>>>>> to the document. Additionally, I gave you all permissions to edit the
>>>>> documents, so you can add missing points yourselves.
>>>>>
>>>>> Of course we also need threads that express our preferences (voting).
>>>>> I would suggest to keep these separate from discussions about single 
>>>>> points
>>>>> so that they can be persisted in the document.
>>>>>
>>>>> After a phase of collecting arguments for the different designs I
>>>>> think it would make sense to have video call to have a face to face
>>>>> discussion.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Jan
>>>>> On 20.02.24 21:32, Manish Malhotra wrote:
>>>>>
>>>>> Very excited for MV to be in Iceberg :)
>>>>> Keeping in the same doc. would be helpful, to have the trail.
>>>>> But also agreed, if there are too many directions/threads, then keep
>>>>> closing the old one, if there are no more questions.
>>>>> And put down the assumptions for the initial version to move forward.
>>>>>
>>>>>
>>>>> On Tue, Feb 20, 2024 at 12:17 PM Walaa Eldin Moustafa <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I would vote to keep a log in the doc with open questions, and keep
>>>>>> the doc updated with open questions as they arise/get resolved.
>>>>>>
>>>>>> On Tue, Feb 20, 2024 at 11:37 AM Jack Ye <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks for the response from everyone!
>>>>>>>
>>>>>>> Before proceeding further, I see a few people referring back to the
>>>>>>> current design from Jan. I specifically raised this thread based on the
>>>>>>> information in the doc and a few latest discussions we had there. 
>>>>>>> Because
>>>>>>> there are many threads in the doc, and each thread points further to 
>>>>>>> other
>>>>>>> discussion threads in the same doc or other doc, it is now quite hard to
>>>>>>> follow and continue discussing all different topics there.
>>>>>>>
>>>>>>> I hope we can make incremental consensus of the questions in the doc
>>>>>>> through devlist, because it provides more visibility, and also a single
>>>>>>> thread instead of multiple threads going on at the same time. If we 
>>>>>>> think
>>>>>>> this format is not effective, I propose that we create a new mv channel 
>>>>>>> in
>>>>>>> Iceberg Slack workspace, and people interested can join and discuss all
>>>>>>> these points directly. What do we think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 19, 2024 at 6:03 PM Szehon Ho <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Great to see more discussion on the MV spec.  Actually, Jan's
>>>>>>>> document "Iceberg Materialized View Spec"
>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A>
>>>>>>>>  has
>>>>>>>> been organized , with a "Design Questions" section to track these 
>>>>>>>> debates,
>>>>>>>> and it would be nice to centralize the debates there, as Micah 
>>>>>>>> mentions.
>>>>>>>>
>>>>>>>> For Dan's question, I think this debate was tracked in "Design Question
>>>>>>>> 3: Should the storage table be registered in the catalog?". I think the
>>>>>>>> general idea there was to not expose it directly via Catalog as it is 
>>>>>>>> then
>>>>>>>> exposed to user modification. If the engine wants to access anything 
>>>>>>>> about
>>>>>>>> the storage table (including audit and storage), it is of course there 
>>>>>>>> via
>>>>>>>> the storage table pointer. I think Walaa's point is also good, we could
>>>>>>>> expose it as we expose metadata tables, but I am still not sure if 
>>>>>>>> there is
>>>>>>>> still some use-cases of engine access not covered?
>>>>>>>>
>>>>>>>> It is true that for Jack's initial question (Do we really want to
>>>>>>>> go with the MV = view + storage table design approach for Iceberg MV?),
>>>>>>>> unfortunately we did not capture it as a "Design Question" in Jan's 
>>>>>>>> doc, as
>>>>>>>> it was an implicit assumption of 'yes', because it is the choice of 
>>>>>>>> Hive,
>>>>>>>> Trino, and other engines , as others have pointed out.
>>>>>>>>
>>>>>>>> Jack's point about potential evolution of MV (like to add
>>>>>>>> partitioning) is an interesting one, but definitely hard to grasp.  I 
>>>>>>>> think
>>>>>>>> it makes sense to add this as a separate Design Question in the doc, 
>>>>>>>> and
>>>>>>>> add the options.  This will allow us to flesh out this alternative
>>>>>>>> option(s).  Maybe Micah's point about modifying existing proposal to
>>>>>>>> 'embed' the required table metadata fields in the existing view 
>>>>>>>> metadata,
>>>>>>>> is one middle ground option.  Or we add a totally new MV object spec 
>>>>>>>> for
>>>>>>>> MV, separate than existing View spec?
>>>>>>>>
>>>>>>>> Also , as Jack pointed out, it may make sense to have the REST /
>>>>>>>> Catalog API proposal in the doc to educate the above decision.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Szehon
>>>>>>>>
>>>>>>>> On Mon, Feb 19, 2024 at 4:08 PM Walaa Eldin Moustafa <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I think it would help if we answer the question of whether an MV
>>>>>>>>> is a view + storage table (and degree of exposing this underlying
>>>>>>>>> implementation) in the context of the user interfacing with those 
>>>>>>>>> concepts:
>>>>>>>>>
>>>>>>>>> For the end user, interfacing with the engine APIs (e.g., through
>>>>>>>>> SQL), materialized view APIs should be almost the same as regular 
>>>>>>>>> view APIs
>>>>>>>>> (except for operations specific to materialized views like REFRESH 
>>>>>>>>> command
>>>>>>>>> etc). Typically, the end user interacts with the (materialized) view 
>>>>>>>>> object
>>>>>>>>> as a view, and the engine performs the abstraction over the storage 
>>>>>>>>> table.
>>>>>>>>>
>>>>>>>>> For the engines interfacing with Iceberg, it sounds the correct
>>>>>>>>> abstraction at this layer is indeed view + storage table, and engines 
>>>>>>>>> could
>>>>>>>>> have access to both objects to optimize queries.
>>>>>>>>>
>>>>>>>>> So in a sense, the engine will ultimately hide most of the
>>>>>>>>> storage detail from the end user (except for advanced users who want 
>>>>>>>>> to
>>>>>>>>> explicitly access the storage table with a modifier like
>>>>>>>>> "db.view.storageTable" -- and they can only read it), while Iceberg 
>>>>>>>>> will
>>>>>>>>> expose the storage details to the engine catalog to use it in scans if
>>>>>>>>> needed. So the storage table is hidden or exposed based on the 
>>>>>>>>> context/the
>>>>>>>>> actual users. From Iceberg point of view (which interacts with the
>>>>>>>>> engines), the storage table is exposed. Note that this does not
>>>>>>>>> necessarily mean that the storage table is registered in the catalog 
>>>>>>>>> with
>>>>>>>>> its own independent name (e.g., where we can drop the view but keep 
>>>>>>>>> the
>>>>>>>>> storage table and access it from the catalog). Addressing the storage 
>>>>>>>>> table
>>>>>>>>> using a virtual namespace like "db.view.storageTable" sounds like a 
>>>>>>>>> good
>>>>>>>>> middle ground. Anyways, end users should not need to directly access 
>>>>>>>>> the
>>>>>>>>> storage table in most cases.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>> On Mon, Feb 19, 2024 at 3:38 PM Micah Kornfield <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Jack,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> In my mind, the first key point we all need to agree upon to
>>>>>>>>>>> move this design forward is*: Do we really want to go with the
>>>>>>>>>>> MV = view + storage table design approach for Iceberg MV?*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think we want this to the extent that we do not want to
>>>>>>>>>> redefine the same concept with different representations/naming to 
>>>>>>>>>> the
>>>>>>>>>> greatest degree possible.  This is why borrowing the concepts from 
>>>>>>>>>> the view
>>>>>>>>>> (e.g. multiple ways of expressing the same view logic in different
>>>>>>>>>> dialects) and aspects of the materialized data (e.g. partitioning,
>>>>>>>>>> ordering) feels most natural.  IIUC your proposal, I think you are 
>>>>>>>>>> saying
>>>>>>>>>> maybe two modifications to the existing proposals in the document:
>>>>>>>>>>
>>>>>>>>>> 1.  No separate storage table link, instead embed most of the
>>>>>>>>>> metadata of the materialized table into the MV document (the 
>>>>>>>>>> exception
>>>>>>>>>> seems to be snapshot history)
>>>>>>>>>> 2.  For snapshot history, have one unified history specific to
>>>>>>>>>> the MV.
>>>>>>>>>>
>>>>>>>>>> This seems fairly reasonable to me and I think I can solve some
>>>>>>>>>> challenges with the existing proposal in an elegant way.  If this is
>>>>>>>>>> correct (or maybe if it isn't quite correct) perhaps you can make
>>>>>>>>>> suggestions to the document so all of the trade-offs can be 
>>>>>>>>>> discussed in
>>>>>>>>>> one place?
>>>>>>>>>>
>>>>>>>>>> I think the one thing the current draft of the materialized view
>>>>>>>>>> ignores is how to store algebraic summaries (e.g. separate sum and 
>>>>>>>>>> count
>>>>>>>>>> for averages, or other sketches), so that new data can be 
>>>>>>>>>> incrementally
>>>>>>>>>> incorporated.  But representing these structures feels like it 
>>>>>>>>>> potentially
>>>>>>>>>> has value beyond just MVs (e.g. it can be a natural way to express 
>>>>>>>>>> summary
>>>>>>>>>> statistics in table metadata), so I think it deserves at least a try 
>>>>>>>>>> in
>>>>>>>>>> incorporating the concepts in the table specification, so the 
>>>>>>>>>> definitions
>>>>>>>>>> can be shared.  I was imagining this could come as part of the next
>>>>>>>>>> revision of MV specification.
>>>>>>>>>>
>>>>>>>>>> The MV internal structure could evolve in a way that works more
>>>>>>>>>>> efficiently with the reduced scope of functionalities, without 
>>>>>>>>>>> relying on
>>>>>>>>>>> table to offer the same capabilities. I can at least say that is 
>>>>>>>>>>> true based
>>>>>>>>>>> on my internal knowledge of how Redshift MVs work.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm not sure I fully understand this point, but it seems the main
>>>>>>>>>> question here is what would break if it started to evolve in this
>>>>>>>>>> direction.  Is it purely additive or do we suspect some elements 
>>>>>>>>>> would need
>>>>>>>>>> to be removed?  My gut feeling here is the main concerns here are  
>>>>>>>>>> getting
>>>>>>>>>> the cardinatities correct (i.e. 1 MV should probably have 0, 1 or 
>>>>>>>>>> more
>>>>>>>>>> materialized storage tables associated with it, to support more 
>>>>>>>>>> advanced
>>>>>>>>>> algebraic structures listed above, and perhaps a second without 
>>>>>>>>>> them, and
>>>>>>>>>> additional metadata to distinguish between these two different 
>>>>>>>>>> modes).
>>>>>>>>>>
>>>>>>>>>> If after the evaluation, we are confident that the MV = view +
>>>>>>>>>>> storage table approach is the right way to go, then we can debate 
>>>>>>>>>>> the other
>>>>>>>>>>> issues, and I think the next issue to reach consensus should be 
>>>>>>>>>>> "Should the
>>>>>>>>>>> storage table be registered in the catalog?".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I actually think there are actually more fundamental questions
>>>>>>>>>> posed:
>>>>>>>>>> 1.  Should be considering how items should be modelled in the
>>>>>>>>>> REST API concurrently with the Iceberg spec, as that potentially 
>>>>>>>>>> impacts
>>>>>>>>>> design decision (I think the answer is yes, and we should update the 
>>>>>>>>>> doc
>>>>>>>>>> with sketches on new endpoints and operations on the endpoints to 
>>>>>>>>>> ensure
>>>>>>>>>> things align).
>>>>>>>>>> 2.  Going forward should new aspects of Iceberg artifacts rely on
>>>>>>>>>> the fact that a catalog is present and we can rely on a naming 
>>>>>>>>>> convention
>>>>>>>>>> for looking up other artifacts in a catalog as pointers (I lean yes 
>>>>>>>>>> on
>>>>>>>>>> this, but I'm a little bit more ambivalent).
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Micah
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 19, 2024 at 12:52 PM Jack Ye <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I suggest we need a step-by-step process to make incremental
>>>>>>>>>>> consensus, otherwise we are constantly talking about many different 
>>>>>>>>>>> debates
>>>>>>>>>>> at the same time.
>>>>>>>>>>>
>>>>>>>>>>> In my mind, the first key point we all need to agree upon to
>>>>>>>>>>> move this design forward is*: Do we really want to go with the
>>>>>>>>>>> MV = view + storage table design approach for Iceberg MV?*
>>>>>>>>>>>
>>>>>>>>>>> I think we (at least me) started with this assumption, mostly
>>>>>>>>>>> because this is how Trino implements MV, and how Hive tables store 
>>>>>>>>>>> MV
>>>>>>>>>>> information today. But does it mean we should design it that way in 
>>>>>>>>>>> Iceberg?
>>>>>>>>>>>
>>>>>>>>>>> Now I look back at how we did the view spec design, we could
>>>>>>>>>>> also say that we just add a representation field in the table spec 
>>>>>>>>>>> to store
>>>>>>>>>>> view, and an Iceberg view is just a table with no data but with
>>>>>>>>>>> representations defined. But we did not do that. So it feels now 
>>>>>>>>>>> quite
>>>>>>>>>>> inconsistent to say we want to just add a few fields in the table 
>>>>>>>>>>> and view
>>>>>>>>>>> spec to call it an Iceberg MV.
>>>>>>>>>>>
>>>>>>>>>>> If we look into most of the other database systems (e.g.
>>>>>>>>>>> Redshift, BigQuery, Snowflake), they never expose such 
>>>>>>>>>>> implementation
>>>>>>>>>>> details like storage table. Apart from being close-sourced systems, 
>>>>>>>>>>> I think
>>>>>>>>>>> it is also for good technical reasons. There are many more things 
>>>>>>>>>>> that a
>>>>>>>>>>> table needs to support, but does not really apply to MV. The MV 
>>>>>>>>>>> internal
>>>>>>>>>>> structure could evolve in a way that works more efficiently with the
>>>>>>>>>>> reduced scope of functionalities, without relying on table to offer 
>>>>>>>>>>> the
>>>>>>>>>>> same capabilities. I can at least say that is true based on my 
>>>>>>>>>>> internal
>>>>>>>>>>> knowledge of how Redshift MVs work.
>>>>>>>>>>>
>>>>>>>>>>> I think we should fully evaluate both directions, and commit to
>>>>>>>>>>> one first before debating more things.
>>>>>>>>>>>
>>>>>>>>>>> If we have a new and independent Iceberg MV spec, then an
>>>>>>>>>>> Iceberg MV is under-the-hood a single object containing all MV 
>>>>>>>>>>> information.
>>>>>>>>>>> It has its own name, snapshots, view representation, etc. I don't 
>>>>>>>>>>> believe
>>>>>>>>>>> we will be blocked by Trino due to its MV SPIs currently requiring 
>>>>>>>>>>> the
>>>>>>>>>>> existence of a storage table, as it will just be a different 
>>>>>>>>>>> implementation
>>>>>>>>>>> from the existing one in Trino-Iceberg. In this direction, I don't 
>>>>>>>>>>> think we
>>>>>>>>>>> need to have any further debate about pointers, metadata locations, 
>>>>>>>>>>> storage
>>>>>>>>>>> table, etc. because everything will be new.
>>>>>>>>>>>
>>>>>>>>>>> If after the evaluation, we are confident that the MV = view +
>>>>>>>>>>> storage table approach is the right way to go, then we can debate 
>>>>>>>>>>> the other
>>>>>>>>>>> issues, and I think the next issue to reach consensus should be 
>>>>>>>>>>> "Should the
>>>>>>>>>>> storage table be registered in the catalog?".
>>>>>>>>>>>
>>>>>>>>>>> What do we think?
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 19, 2024 at 11:32 AM Daniel Weeks <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Jack,
>>>>>>>>>>>>
>>>>>>>>>>>> I think we should consider either allowing the storage table to
>>>>>>>>>>>> be fully exposed/addressable via the catalog or allow access via
>>>>>>>>>>>> namespacing like with metadata tables.  E.g.
>>>>>>>>>>>> <catalog>.<database>.<table>.<storage>, which would allow for full 
>>>>>>>>>>>> access
>>>>>>>>>>>> to the underlying table.
>>>>>>>>>>>>
>>>>>>>>>>>> For other engines to interact with the storage table (e.g. to
>>>>>>>>>>>> execute the query to materialize the table), it may be necessary 
>>>>>>>>>>>> that the
>>>>>>>>>>>> table is fully addressable.  Whether the storage table is returned 
>>>>>>>>>>>> as part
>>>>>>>>>>>> of list operations may be something we leave up to the catalog
>>>>>>>>>>>> implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think the table should reference a physical location
>>>>>>>>>>>> (only a logical reference) since things will be changing behind 
>>>>>>>>>>>> the view
>>>>>>>>>>>> definition and I'm not confident we want to have to update the view
>>>>>>>>>>>> representation everytime the storage table is updated.
>>>>>>>>>>>>
>>>>>>>>>>>> I think there's still some exploration as to whether we need to
>>>>>>>>>>>> model this as separate from view endpoints, but there may be 
>>>>>>>>>>>> enough overlap
>>>>>>>>>>>> that it's not necessary to have yet another set of endpoints for
>>>>>>>>>>>> materialized views (maybe filter params if you need to 
>>>>>>>>>>>> distinguish?).
>>>>>>>>>>>>
>>>>>>>>>>>> -Dan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Feb 18, 2024 at 6:57 PM Renjie Liu <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, Jack:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for raising this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In most database systems, MV, view and table are considered
>>>>>>>>>>>>>> independent objects, at least at API level. It is very rare for 
>>>>>>>>>>>>>> a system to
>>>>>>>>>>>>>> support operations like "materializing a logical view" or 
>>>>>>>>>>>>>> "upgrading a
>>>>>>>>>>>>>> logical view to MV", because view and MV are very different in 
>>>>>>>>>>>>>> almost every
>>>>>>>>>>>>>> aspect of user experience. Extending the existing view or table 
>>>>>>>>>>>>>> spec to
>>>>>>>>>>>>>> accommodate MV might give us a MV implementation similar to the 
>>>>>>>>>>>>>> current
>>>>>>>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, 
>>>>>>>>>>>>>> but it
>>>>>>>>>>>>>> binds us to a very specific design of MV, which we might regret 
>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> When I reviewed the doc, I thought we were discussing the spec
>>>>>>>>>>>>> of materialized view, just like the spec of table metadata, but 
>>>>>>>>>>>>> didn't not
>>>>>>>>>>>>> the user facing api. I would definitely agree that we should 
>>>>>>>>>>>>> consider MV as
>>>>>>>>>>>>> another kind of database object in user facing api, even though 
>>>>>>>>>>>>> it's
>>>>>>>>>>>>> internally modelled as a view + storage table pointer.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we want to make the REST experience good for MV, I think we
>>>>>>>>>>>>>> should at least consider directly describing the full metadata 
>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON 
>>>>>>>>>>>>>> file.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you mean we need to add components like
>>>>>>>>>>>>> `LoadMaterializedViewResponse`, if so, I would +1 for this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with
>>>>>>>>>>>>>> MVs?*
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> As I have mentioned above,  I think we should consider MV as
>>>>>>>>>>>>> another database object, so I think we should add a set of apis
>>>>>>>>>>>>> specifically designed for MV, such as `loadMV`, `freshMV`.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Feb 17, 2024 at 11:14 AM Jack Ye <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As we are discussing the spec change for materialized view,
>>>>>>>>>>>>>> there has been a missing aspect that is technically also 
>>>>>>>>>>>>>> related, and might
>>>>>>>>>>>>>> affect the MV spec design: *how do we want to add MV support
>>>>>>>>>>>>>> to the REST spec?*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to discuss this in a new thread to collect
>>>>>>>>>>>>>> people's thoughts. This topic expands to the following 2 
>>>>>>>>>>>>>> sub-questions:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Q1: how would the MV spec change affect the REST spec?*
>>>>>>>>>>>>>> In the current proposal, it looks like we are using a design
>>>>>>>>>>>>>> where a MV is modeled as an Iceberg view linking to an Iceberg 
>>>>>>>>>>>>>> storage
>>>>>>>>>>>>>> table. At the same time, we do not want to expose this storage 
>>>>>>>>>>>>>> table in the
>>>>>>>>>>>>>> catalog, thus the Iceberg view has a pointer to only a metadata 
>>>>>>>>>>>>>> JSON file
>>>>>>>>>>>>>> of the Iceberg storage table. Each MV refresh updates the 
>>>>>>>>>>>>>> pointer to a new
>>>>>>>>>>>>>> metadata JSON file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I feel this does not play very well with the direction that
>>>>>>>>>>>>>> REST is going. The REST catalog is trying to remove the 
>>>>>>>>>>>>>> dependency to the
>>>>>>>>>>>>>> metadata JSON file. For example, in LoadTableResponse the only 
>>>>>>>>>>>>>> required
>>>>>>>>>>>>>> field is the metadata, and metadata-location is actually 
>>>>>>>>>>>>>> optional.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we want to make the REST experience good for MV, I think
>>>>>>>>>>>>>> we should at least consider directly describing the full 
>>>>>>>>>>>>>> metadata of the
>>>>>>>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON 
>>>>>>>>>>>>>> file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with
>>>>>>>>>>>>>> MVs?*
>>>>>>>>>>>>>> So far we have been thinking about amending the view spec to
>>>>>>>>>>>>>> accommodate MV. This entails likely having MVs also being 
>>>>>>>>>>>>>> handled through
>>>>>>>>>>>>>> the view APIs in REST spec.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We need to agree with that first in the community, because
>>>>>>>>>>>>>> this has various implications, and I am not really sure at this 
>>>>>>>>>>>>>> point if it
>>>>>>>>>>>>>> is the best way to go.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If MV interactions are through the view APIs, the view APIs
>>>>>>>>>>>>>> need to be updated to accommodate MV constructs that are not 
>>>>>>>>>>>>>> really related
>>>>>>>>>>>>>> to logical views. In fact, most actions performed on MVs are 
>>>>>>>>>>>>>> more similar
>>>>>>>>>>>>>> to actions performed on table rather than view, which involve 
>>>>>>>>>>>>>> configuring
>>>>>>>>>>>>>> data layout, read and write constructs. For example, users might 
>>>>>>>>>>>>>> run
>>>>>>>>>>>>>> something like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CREATE MATERIALIZED VIEW mv
>>>>>>>>>>>>>> PARTITION BY col1
>>>>>>>>>>>>>> CLUSTER BY col2
>>>>>>>>>>>>>> AS ( // some sql )
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> then the CreateView API needs to accept partition spec and
>>>>>>>>>>>>>> sort order that are completely not relevant for logical views.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When reading a MV, we might even want to have a
>>>>>>>>>>>>>> PlanMaterializedView API similar to the PlanTable API we are 
>>>>>>>>>>>>>> adding.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *My personal take*
>>>>>>>>>>>>>> It feels like we need to reconsider the question of what is
>>>>>>>>>>>>>> the best way to model MV in Iceberg. Should it be (1) a view 
>>>>>>>>>>>>>> linked to a
>>>>>>>>>>>>>> storage table, or (2) a table with a view SQL associated with 
>>>>>>>>>>>>>> it, or (3)
>>>>>>>>>>>>>> it's a completely independent thing. This topic was discussed in 
>>>>>>>>>>>>>> the past in
>>>>>>>>>>>>>> this doc
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/edit?pli=1>,
>>>>>>>>>>>>>> but at that time we did not have much perspective about aspects 
>>>>>>>>>>>>>> like REST
>>>>>>>>>>>>>> spec, and the view integration was also not fully completed yet. 
>>>>>>>>>>>>>> With the
>>>>>>>>>>>>>> new knowledge, currently I am actually leaning a bit more 
>>>>>>>>>>>>>> towards (3).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In most database systems, MV, view and table are considered
>>>>>>>>>>>>>> independent objects, at least at API level. It is very rare for 
>>>>>>>>>>>>>> a system to
>>>>>>>>>>>>>> support operations like "materializing a logical view" or 
>>>>>>>>>>>>>> "upgrading a
>>>>>>>>>>>>>> logical view to MV", because view and MV are very different in 
>>>>>>>>>>>>>> almost every
>>>>>>>>>>>>>> aspect of user experience. Extending the existing view or table 
>>>>>>>>>>>>>> spec to
>>>>>>>>>>>>>> accommodate MV might give us a MV implementation similar to the 
>>>>>>>>>>>>>> current
>>>>>>>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, 
>>>>>>>>>>>>>> but it
>>>>>>>>>>>>>> binds us to a very specific design of MV, which we might regret 
>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we make a new MV spec, it can be made up of fields that
>>>>>>>>>>>>>> already exist in the table and view specs, but it is a whole new 
>>>>>>>>>>>>>> spec. In
>>>>>>>>>>>>>> this way, the spec can evolve independently to accommodate MV 
>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>> features, and we can also create MV-related REST endpoints that 
>>>>>>>>>>>>>> will evolve
>>>>>>>>>>>>>> independently from table and view REST APIs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But on the other side it is definitely associated with more
>>>>>>>>>>>>>> work to maintain a new spec, and potentially big refactoring in 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> codebase to make sure operations today that work on table or 
>>>>>>>>>>>>>> view can now
>>>>>>>>>>>>>> support MV as a different object. And it definitely has other 
>>>>>>>>>>>>>> problems that
>>>>>>>>>>>>>> I have overlooked. I would greatly appreciate any thoughts about 
>>>>>>>>>>>>>> this!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to