Re: Materialized view integration with REST spec

Jack Ye Wed, 21 Feb 2024 20:53:02 -0800

Thanks everyone for the help in organizing the thoughts!

I have moved the summary of everyone's comments here also to the doc that
Jan linked under question 0. We can continue to have more discussions there
and cast votes!


Best,
Jack Ye

On Wed, Feb 21, 2024 at 12:14 PM Jan Kaul <[email protected]>
wrote:

> Thanks Micah, I think the voting chips are great.
>
> @Szehon, actually what I had in mind was not to have one thread per
> question but rather have smaller threads that can be resolved more easily.
> I have the fear that one thread for the current question would lead to a
> very long and unmanageable discussion.
>
> I've added another row to the table where everyone could provide a summary
> of their reason for choosing a certain design. This way we could move some
> of the content from the comment threads to the main document.
> On 21.02.24 19:58, Micah Kornfield wrote:
>
> Of course we also need threads that express our preferences (voting). I
>> would suggest to keep these separate from discussions about single points
>> so that they can be persisted in the document.
>
>
> Not sure if it helpful, but I added voting chips Question 0, as maybe an
> easier way to keep track of votes.  If it is helpful, I can add them in
> other places that still need a vote (I think one needs a paid Google Docs
> account to insert them).
>
> Thanks,
> Micah
>
> On Wed, Feb 21, 2024 at 10:23 AM Szehon Ho <[email protected]>
> wrote:
>
>> Thanks Jan.  +1 on having just one thread per question for
>> vote/preference.  Where do you suggest we have it, on the discussion
>> question itself?  It would be to keep the existing threads and move it
>> there.
>>
>> Also, I think it makes sense with making a slack channel (for quick
>> question, reply) , and also discuss unresolved questions in the next week's
>> sync or a separate meeting.
>>
>> On Wed, Feb 21, 2024 at 12:40 AM Jan Kaul <[email protected]>
>> <[email protected]> wrote:
>>
>>> Thank you Jack for driving the consensus for the MV spec and thank you
>>> all for the discussion.
>>>
>>> I really like the idea about incremental consensus because we often
>>> loose sight in detailed discussions. As Jack mentioned, the highest
>>> priority question currently is:
>>>
>>> *Should the Iceberg MV be realized as a view + storage table or do we
>>> define a new metadata format? *To have one place for the discussion, I
>>> created another Question (
>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi)
>>> to the Materialized View Spec google document.
>>>
>>> To improve the visibility of the arguments I would like to propose a new
>>> process. It would be great if all relevant information is stored in the
>>> document itself. Therefore I would suggest to use the comment threads for
>>> smaller, temporary discussions which can be resolved by adding the points
>>> to the main document. Please close the threads if the information was added
>>> to the document. Additionally, I gave you all permissions to edit the
>>> documents, so you can add missing points yourselves.
>>>
>>> Of course we also need threads that express our preferences (voting). I
>>> would suggest to keep these separate from discussions about single points
>>> so that they can be persisted in the document.
>>>
>>> After a phase of collecting arguments for the different designs I think
>>> it would make sense to have video call to have a face to face discussion.
>>>
>>> What do you think?
>>>
>>> Best wishes,
>>>
>>> Jan
>>> On 20.02.24 21:32, Manish Malhotra wrote:
>>>
>>> Very excited for MV to be in Iceberg :)
>>> Keeping in the same doc. would be helpful, to have the trail.
>>> But also agreed, if there are too many directions/threads, then keep
>>> closing the old one, if there are no more questions.
>>> And put down the assumptions for the initial version to move forward.
>>>
>>>
>>> On Tue, Feb 20, 2024 at 12:17 PM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>>
>>>> I would vote to keep a log in the doc with open questions, and keep the
>>>> doc updated with open questions as they arise/get resolved.
>>>>
>>>> On Tue, Feb 20, 2024 at 11:37 AM Jack Ye <[email protected]> wrote:
>>>>
>>>>> Thanks for the response from everyone!
>>>>>
>>>>> Before proceeding further, I see a few people referring back to the
>>>>> current design from Jan. I specifically raised this thread based on the
>>>>> information in the doc and a few latest discussions we had there. Because
>>>>> there are many threads in the doc, and each thread points further to other
>>>>> discussion threads in the same doc or other doc, it is now quite hard to
>>>>> follow and continue discussing all different topics there.
>>>>>
>>>>> I hope we can make incremental consensus of the questions in the doc
>>>>> through devlist, because it provides more visibility, and also a single
>>>>> thread instead of multiple threads going on at the same time. If we think
>>>>> this format is not effective, I propose that we create a new mv channel in
>>>>> Iceberg Slack workspace, and people interested can join and discuss all
>>>>> these points directly. What do we think?
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 19, 2024 at 6:03 PM Szehon Ho <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Great to see more discussion on the MV spec.  Actually, Jan's
>>>>>> document "Iceberg Materialized View Spec"
>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A>
>>>>>>  has
>>>>>> been organized , with a "Design Questions" section to track these 
>>>>>> debates,
>>>>>> and it would be nice to centralize the debates there, as Micah mentions.
>>>>>>
>>>>>> For Dan's question, I think this debate was tracked in "Design Question
>>>>>> 3: Should the storage table be registered in the catalog?". I think the
>>>>>> general idea there was to not expose it directly via Catalog as it is 
>>>>>> then
>>>>>> exposed to user modification. If the engine wants to access anything 
>>>>>> about
>>>>>> the storage table (including audit and storage), it is of course there 
>>>>>> via
>>>>>> the storage table pointer. I think Walaa's point is also good, we could
>>>>>> expose it as we expose metadata tables, but I am still not sure if there 
>>>>>> is
>>>>>> still some use-cases of engine access not covered?
>>>>>>
>>>>>> It is true that for Jack's initial question (Do we really want to go
>>>>>> with the MV = view + storage table design approach for Iceberg MV?),
>>>>>> unfortunately we did not capture it as a "Design Question" in Jan's doc, 
>>>>>> as
>>>>>> it was an implicit assumption of 'yes', because it is the choice of Hive,
>>>>>> Trino, and other engines , as others have pointed out.
>>>>>>
>>>>>> Jack's point about potential evolution of MV (like to add
>>>>>> partitioning) is an interesting one, but definitely hard to grasp.  I 
>>>>>> think
>>>>>> it makes sense to add this as a separate Design Question in the doc, and
>>>>>> add the options.  This will allow us to flesh out this alternative
>>>>>> option(s).  Maybe Micah's point about modifying existing proposal to
>>>>>> 'embed' the required table metadata fields in the existing view metadata,
>>>>>> is one middle ground option.  Or we add a totally new MV object spec for
>>>>>> MV, separate than existing View spec?
>>>>>>
>>>>>> Also , as Jack pointed out, it may make sense to have the REST /
>>>>>> Catalog API proposal in the doc to educate the above decision.
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>> On Mon, Feb 19, 2024 at 4:08 PM Walaa Eldin Moustafa <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I think it would help if we answer the question of whether an MV is
>>>>>>> a view + storage table (and degree of exposing this underlying
>>>>>>> implementation) in the context of the user interfacing with those 
>>>>>>> concepts:
>>>>>>>
>>>>>>> For the end user, interfacing with the engine APIs (e.g., through
>>>>>>> SQL), materialized view APIs should be almost the same as regular view 
>>>>>>> APIs
>>>>>>> (except for operations specific to materialized views like REFRESH 
>>>>>>> command
>>>>>>> etc). Typically, the end user interacts with the (materialized) view 
>>>>>>> object
>>>>>>> as a view, and the engine performs the abstraction over the storage 
>>>>>>> table.
>>>>>>>
>>>>>>> For the engines interfacing with Iceberg, it sounds the correct
>>>>>>> abstraction at this layer is indeed view + storage table, and engines 
>>>>>>> could
>>>>>>> have access to both objects to optimize queries.
>>>>>>>
>>>>>>> So in a sense, the engine will ultimately hide most of the
>>>>>>> storage detail from the end user (except for advanced users who want to
>>>>>>> explicitly access the storage table with a modifier like
>>>>>>> "db.view.storageTable" -- and they can only read it), while Iceberg will
>>>>>>> expose the storage details to the engine catalog to use it in scans if
>>>>>>> needed. So the storage table is hidden or exposed based on the 
>>>>>>> context/the
>>>>>>> actual users. From Iceberg point of view (which interacts with the
>>>>>>> engines), the storage table is exposed. Note that this does not
>>>>>>> necessarily mean that the storage table is registered in the catalog 
>>>>>>> with
>>>>>>> its own independent name (e.g., where we can drop the view but keep the
>>>>>>> storage table and access it from the catalog). Addressing the storage 
>>>>>>> table
>>>>>>> using a virtual namespace like "db.view.storageTable" sounds like a good
>>>>>>> middle ground. Anyways, end users should not need to directly access the
>>>>>>> storage table in most cases.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>> On Mon, Feb 19, 2024 at 3:38 PM Micah Kornfield <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Jack,
>>>>>>>>
>>>>>>>>
>>>>>>>>> In my mind, the first key point we all need to agree upon to move
>>>>>>>>> this design forward is*: Do we really want to go with the MV =
>>>>>>>>> view + storage table design approach for Iceberg MV?*
>>>>>>>>
>>>>>>>>
>>>>>>>> I think we want this to the extent that we do not want to redefine
>>>>>>>> the same concept with different representations/naming to the greatest
>>>>>>>> degree possible.  This is why borrowing the concepts from the view 
>>>>>>>> (e.g.
>>>>>>>> multiple ways of expressing the same view logic in different dialects) 
>>>>>>>> and
>>>>>>>> aspects of the materialized data (e.g. partitioning, ordering) feels 
>>>>>>>> most
>>>>>>>> natural.  IIUC your proposal, I think you are saying maybe two
>>>>>>>> modifications to the existing proposals in the document:
>>>>>>>>
>>>>>>>> 1.  No separate storage table link, instead embed most of the
>>>>>>>> metadata of the materialized table into the MV document (the exception
>>>>>>>> seems to be snapshot history)
>>>>>>>> 2.  For snapshot history, have one unified history specific to the
>>>>>>>> MV.
>>>>>>>>
>>>>>>>> This seems fairly reasonable to me and I think I can solve some
>>>>>>>> challenges with the existing proposal in an elegant way.  If this is
>>>>>>>> correct (or maybe if it isn't quite correct) perhaps you can make
>>>>>>>> suggestions to the document so all of the trade-offs can be discussed 
>>>>>>>> in
>>>>>>>> one place?
>>>>>>>>
>>>>>>>> I think the one thing the current draft of the materialized view
>>>>>>>> ignores is how to store algebraic summaries (e.g. separate sum and 
>>>>>>>> count
>>>>>>>> for averages, or other sketches), so that new data can be incrementally
>>>>>>>> incorporated.  But representing these structures feels like it 
>>>>>>>> potentially
>>>>>>>> has value beyond just MVs (e.g. it can be a natural way to express 
>>>>>>>> summary
>>>>>>>> statistics in table metadata), so I think it deserves at least a try in
>>>>>>>> incorporating the concepts in the table specification, so the 
>>>>>>>> definitions
>>>>>>>> can be shared.  I was imagining this could come as part of the next
>>>>>>>> revision of MV specification.
>>>>>>>>
>>>>>>>> The MV internal structure could evolve in a way that works more
>>>>>>>>> efficiently with the reduced scope of functionalities, without 
>>>>>>>>> relying on
>>>>>>>>> table to offer the same capabilities. I can at least say that is true 
>>>>>>>>> based
>>>>>>>>> on my internal knowledge of how Redshift MVs work.
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not sure I fully understand this point, but it seems the main
>>>>>>>> question here is what would break if it started to evolve in this
>>>>>>>> direction.  Is it purely additive or do we suspect some elements would 
>>>>>>>> need
>>>>>>>> to be removed?  My gut feeling here is the main concerns here are  
>>>>>>>> getting
>>>>>>>> the cardinatities correct (i.e. 1 MV should probably have 0, 1 or more
>>>>>>>> materialized storage tables associated with it, to support more 
>>>>>>>> advanced
>>>>>>>> algebraic structures listed above, and perhaps a second without them, 
>>>>>>>> and
>>>>>>>> additional metadata to distinguish between these two different modes).
>>>>>>>>
>>>>>>>> If after the evaluation, we are confident that the MV = view +
>>>>>>>>> storage table approach is the right way to go, then we can debate the 
>>>>>>>>> other
>>>>>>>>> issues, and I think the next issue to reach consensus should be 
>>>>>>>>> "Should the
>>>>>>>>> storage table be registered in the catalog?".
>>>>>>>>
>>>>>>>>
>>>>>>>> I actually think there are actually more fundamental questions
>>>>>>>> posed:
>>>>>>>> 1.  Should be considering how items should be modelled in the REST
>>>>>>>> API concurrently with the Iceberg spec, as that potentially impacts 
>>>>>>>> design
>>>>>>>> decision (I think the answer is yes, and we should update the doc with
>>>>>>>> sketches on new endpoints and operations on the endpoints to ensure 
>>>>>>>> things
>>>>>>>> align).
>>>>>>>> 2.  Going forward should new aspects of Iceberg artifacts rely on
>>>>>>>> the fact that a catalog is present and we can rely on a naming 
>>>>>>>> convention
>>>>>>>> for looking up other artifacts in a catalog as pointers (I lean yes on
>>>>>>>> this, but I'm a little bit more ambivalent).
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Micah
>>>>>>>>
>>>>>>>> On Mon, Feb 19, 2024 at 12:52 PM Jack Ye <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I suggest we need a step-by-step process to make incremental
>>>>>>>>> consensus, otherwise we are constantly talking about many different 
>>>>>>>>> debates
>>>>>>>>> at the same time.
>>>>>>>>>
>>>>>>>>> In my mind, the first key point we all need to agree upon to move
>>>>>>>>> this design forward is*: Do we really want to go with the MV =
>>>>>>>>> view + storage table design approach for Iceberg MV?*
>>>>>>>>>
>>>>>>>>> I think we (at least me) started with this assumption, mostly
>>>>>>>>> because this is how Trino implements MV, and how Hive tables store MV
>>>>>>>>> information today. But does it mean we should design it that way in 
>>>>>>>>> Iceberg?
>>>>>>>>>
>>>>>>>>> Now I look back at how we did the view spec design, we could also
>>>>>>>>> say that we just add a representation field in the table spec to store
>>>>>>>>> view, and an Iceberg view is just a table with no data but with
>>>>>>>>> representations defined. But we did not do that. So it feels now quite
>>>>>>>>> inconsistent to say we want to just add a few fields in the table and 
>>>>>>>>> view
>>>>>>>>> spec to call it an Iceberg MV.
>>>>>>>>>
>>>>>>>>> If we look into most of the other database systems (e.g. Redshift,
>>>>>>>>> BigQuery, Snowflake), they never expose such implementation details 
>>>>>>>>> like
>>>>>>>>> storage table. Apart from being close-sourced systems, I think it is 
>>>>>>>>> also
>>>>>>>>> for good technical reasons. There are many more things that a table 
>>>>>>>>> needs
>>>>>>>>> to support, but does not really apply to MV. The MV internal structure
>>>>>>>>> could evolve in a way that works more efficiently with the reduced 
>>>>>>>>> scope of
>>>>>>>>> functionalities, without relying on table to offer the same 
>>>>>>>>> capabilities. I
>>>>>>>>> can at least say that is true based on my internal knowledge of how
>>>>>>>>> Redshift MVs work.
>>>>>>>>>
>>>>>>>>> I think we should fully evaluate both directions, and commit to
>>>>>>>>> one first before debating more things.
>>>>>>>>>
>>>>>>>>> If we have a new and independent Iceberg MV spec, then an Iceberg
>>>>>>>>> MV is under-the-hood a single object containing all MV information. 
>>>>>>>>> It has
>>>>>>>>> its own name, snapshots, view representation, etc. I don't believe we 
>>>>>>>>> will
>>>>>>>>> be blocked by Trino due to its MV SPIs currently requiring the 
>>>>>>>>> existence of
>>>>>>>>> a storage table, as it will just be a different implementation from 
>>>>>>>>> the
>>>>>>>>> existing one in Trino-Iceberg. In this direction, I don't think we 
>>>>>>>>> need to
>>>>>>>>> have any further debate about pointers, metadata locations, storage 
>>>>>>>>> table,
>>>>>>>>> etc. because everything will be new.
>>>>>>>>>
>>>>>>>>> If after the evaluation, we are confident that the MV = view +
>>>>>>>>> storage table approach is the right way to go, then we can debate the 
>>>>>>>>> other
>>>>>>>>> issues, and I think the next issue to reach consensus should be 
>>>>>>>>> "Should the
>>>>>>>>> storage table be registered in the catalog?".
>>>>>>>>>
>>>>>>>>> What do we think?
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 19, 2024 at 11:32 AM Daniel Weeks <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Jack,
>>>>>>>>>>
>>>>>>>>>> I think we should consider either allowing the storage table to
>>>>>>>>>> be fully exposed/addressable via the catalog or allow access via
>>>>>>>>>> namespacing like with metadata tables.  E.g.
>>>>>>>>>> <catalog>.<database>.<table>.<storage>, which would allow for full 
>>>>>>>>>> access
>>>>>>>>>> to the underlying table.
>>>>>>>>>>
>>>>>>>>>> For other engines to interact with the storage table (e.g. to
>>>>>>>>>> execute the query to materialize the table), it may be necessary 
>>>>>>>>>> that the
>>>>>>>>>> table is fully addressable.  Whether the storage table is returned 
>>>>>>>>>> as part
>>>>>>>>>> of list operations may be something we leave up to the catalog
>>>>>>>>>> implementation.
>>>>>>>>>>
>>>>>>>>>> I don't think the table should reference a physical location
>>>>>>>>>> (only a logical reference) since things will be changing behind the 
>>>>>>>>>> view
>>>>>>>>>> definition and I'm not confident we want to have to update the view
>>>>>>>>>> representation everytime the storage table is updated.
>>>>>>>>>>
>>>>>>>>>> I think there's still some exploration as to whether we need to
>>>>>>>>>> model this as separate from view endpoints, but there may be enough 
>>>>>>>>>> overlap
>>>>>>>>>> that it's not necessary to have yet another set of endpoints for
>>>>>>>>>> materialized views (maybe filter params if you need to distinguish?).
>>>>>>>>>>
>>>>>>>>>> -Dan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Feb 18, 2024 at 6:57 PM Renjie Liu <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, Jack:
>>>>>>>>>>>
>>>>>>>>>>> Thanks for raising this.
>>>>>>>>>>>
>>>>>>>>>>> In most database systems, MV, view and table are considered
>>>>>>>>>>>> independent objects, at least at API level. It is very rare for a 
>>>>>>>>>>>> system to
>>>>>>>>>>>> support operations like "materializing a logical view" or 
>>>>>>>>>>>> "upgrading a
>>>>>>>>>>>> logical view to MV", because view and MV are very different in 
>>>>>>>>>>>> almost every
>>>>>>>>>>>> aspect of user experience. Extending the existing view or table 
>>>>>>>>>>>> spec to
>>>>>>>>>>>> accommodate MV might give us a MV implementation similar to the 
>>>>>>>>>>>> current
>>>>>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, 
>>>>>>>>>>>> but it
>>>>>>>>>>>> binds us to a very specific design of MV, which we might regret in 
>>>>>>>>>>>> the
>>>>>>>>>>>> future.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> When I reviewed the doc, I thought we were discussing the spec
>>>>>>>>>>> of materialized view, just like the spec of table metadata, but 
>>>>>>>>>>> didn't not
>>>>>>>>>>> the user facing api. I would definitely agree that we should 
>>>>>>>>>>> consider MV as
>>>>>>>>>>> another kind of database object in user facing api, even though it's
>>>>>>>>>>> internally modelled as a view + storage table pointer.
>>>>>>>>>>>
>>>>>>>>>>> If we want to make the REST experience good for MV, I think we
>>>>>>>>>>>> should at least consider directly describing the full metadata of 
>>>>>>>>>>>> the
>>>>>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON file.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Do you mean we need to add components like
>>>>>>>>>>> `LoadMaterializedViewResponse`, if so, I would +1 for this.
>>>>>>>>>>>
>>>>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with
>>>>>>>>>>>> MVs?*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> As I have mentioned above,  I think we should consider MV as
>>>>>>>>>>> another database object, so I think we should add a set of apis
>>>>>>>>>>> specifically designed for MV, such as `loadMV`, `freshMV`.
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Feb 17, 2024 at 11:14 AM Jack Ye <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> As we are discussing the spec change for materialized view,
>>>>>>>>>>>> there has been a missing aspect that is technically also related, 
>>>>>>>>>>>> and might
>>>>>>>>>>>> affect the MV spec design: *how do we want to add MV support
>>>>>>>>>>>> to the REST spec?*
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to discuss this in a new thread to collect
>>>>>>>>>>>> people's thoughts. This topic expands to the following 2 
>>>>>>>>>>>> sub-questions:
>>>>>>>>>>>>
>>>>>>>>>>>> *Q1: how would the MV spec change affect the REST spec?*
>>>>>>>>>>>> In the current proposal, it looks like we are using a design
>>>>>>>>>>>> where a MV is modeled as an Iceberg view linking to an Iceberg 
>>>>>>>>>>>> storage
>>>>>>>>>>>> table. At the same time, we do not want to expose this storage 
>>>>>>>>>>>> table in the
>>>>>>>>>>>> catalog, thus the Iceberg view has a pointer to only a metadata 
>>>>>>>>>>>> JSON file
>>>>>>>>>>>> of the Iceberg storage table. Each MV refresh updates the pointer 
>>>>>>>>>>>> to a new
>>>>>>>>>>>> metadata JSON file.
>>>>>>>>>>>>
>>>>>>>>>>>> I feel this does not play very well with the direction that
>>>>>>>>>>>> REST is going. The REST catalog is trying to remove the dependency 
>>>>>>>>>>>> to the
>>>>>>>>>>>> metadata JSON file. For example, in LoadTableResponse the only 
>>>>>>>>>>>> required
>>>>>>>>>>>> field is the metadata, and metadata-location is actually optional.
>>>>>>>>>>>>
>>>>>>>>>>>> If we want to make the REST experience good for MV, I think we
>>>>>>>>>>>> should at least consider directly describing the full metadata of 
>>>>>>>>>>>> the
>>>>>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON file.
>>>>>>>>>>>>
>>>>>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with
>>>>>>>>>>>> MVs?*
>>>>>>>>>>>> So far we have been thinking about amending the view spec to
>>>>>>>>>>>> accommodate MV. This entails likely having MVs also being handled 
>>>>>>>>>>>> through
>>>>>>>>>>>> the view APIs in REST spec.
>>>>>>>>>>>>
>>>>>>>>>>>> We need to agree with that first in the community, because this
>>>>>>>>>>>> has various implications, and I am not really sure at this point 
>>>>>>>>>>>> if it is
>>>>>>>>>>>> the best way to go.
>>>>>>>>>>>>
>>>>>>>>>>>> If MV interactions are through the view APIs, the view APIs
>>>>>>>>>>>> need to be updated to accommodate MV constructs that are not 
>>>>>>>>>>>> really related
>>>>>>>>>>>> to logical views. In fact, most actions performed on MVs are more 
>>>>>>>>>>>> similar
>>>>>>>>>>>> to actions performed on table rather than view, which involve 
>>>>>>>>>>>> configuring
>>>>>>>>>>>> data layout, read and write constructs. For example, users might 
>>>>>>>>>>>> run
>>>>>>>>>>>> something like:
>>>>>>>>>>>>
>>>>>>>>>>>> CREATE MATERIALIZED VIEW mv
>>>>>>>>>>>> PARTITION BY col1
>>>>>>>>>>>> CLUSTER BY col2
>>>>>>>>>>>> AS ( // some sql )
>>>>>>>>>>>>
>>>>>>>>>>>> then the CreateView API needs to accept partition spec and sort
>>>>>>>>>>>> order that are completely not relevant for logical views.
>>>>>>>>>>>>
>>>>>>>>>>>> When reading a MV, we might even want to have a
>>>>>>>>>>>> PlanMaterializedView API similar to the PlanTable API we are 
>>>>>>>>>>>> adding.
>>>>>>>>>>>>
>>>>>>>>>>>> *My personal take*
>>>>>>>>>>>> It feels like we need to reconsider the question of what is the
>>>>>>>>>>>> best way to model MV in Iceberg. Should it be (1) a view linked to 
>>>>>>>>>>>> a
>>>>>>>>>>>> storage table, or (2) a table with a view SQL associated with it, 
>>>>>>>>>>>> or (3)
>>>>>>>>>>>> it's a completely independent thing. This topic was discussed in 
>>>>>>>>>>>> the past in
>>>>>>>>>>>> this doc
>>>>>>>>>>>> <https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/edit?pli=1>,
>>>>>>>>>>>> but at that time we did not have much perspective about aspects 
>>>>>>>>>>>> like REST
>>>>>>>>>>>> spec, and the view integration was also not fully completed yet. 
>>>>>>>>>>>> With the
>>>>>>>>>>>> new knowledge, currently I am actually leaning a bit more towards 
>>>>>>>>>>>> (3).
>>>>>>>>>>>>
>>>>>>>>>>>> In most database systems, MV, view and table are considered
>>>>>>>>>>>> independent objects, at least at API level. It is very rare for a 
>>>>>>>>>>>> system to
>>>>>>>>>>>> support operations like "materializing a logical view" or 
>>>>>>>>>>>> "upgrading a
>>>>>>>>>>>> logical view to MV", because view and MV are very different in 
>>>>>>>>>>>> almost every
>>>>>>>>>>>> aspect of user experience. Extending the existing view or table 
>>>>>>>>>>>> spec to
>>>>>>>>>>>> accommodate MV might give us a MV implementation similar to the 
>>>>>>>>>>>> current
>>>>>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, 
>>>>>>>>>>>> but it
>>>>>>>>>>>> binds us to a very specific design of MV, which we might regret in 
>>>>>>>>>>>> the
>>>>>>>>>>>> future.
>>>>>>>>>>>>
>>>>>>>>>>>> If we make a new MV spec, it can be made up of fields that
>>>>>>>>>>>> already exist in the table and view specs, but it is a whole new 
>>>>>>>>>>>> spec. In
>>>>>>>>>>>> this way, the spec can evolve independently to accommodate MV 
>>>>>>>>>>>> specific
>>>>>>>>>>>> features, and we can also create MV-related REST endpoints that 
>>>>>>>>>>>> will evolve
>>>>>>>>>>>> independently from table and view REST APIs.
>>>>>>>>>>>>
>>>>>>>>>>>> But on the other side it is definitely associated with more
>>>>>>>>>>>> work to maintain a new spec, and potentially big refactoring in the
>>>>>>>>>>>> codebase to make sure operations today that work on table or view 
>>>>>>>>>>>> can now
>>>>>>>>>>>> support MV as a different object. And it definitely has other 
>>>>>>>>>>>> problems that
>>>>>>>>>>>> I have overlooked. I would greatly appreciate any thoughts about 
>>>>>>>>>>>> this!
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to