Re: Materialized view integration with REST spec

Walaa Eldin Moustafa Wed, 21 Feb 2024 21:05:58 -0800

Thanks Jack! I feel Question 0 is very broad, essentially capturing the
whole design. Can we start by discussing more granular questions?


On Wed, Feb 21, 2024 at 8:53 PM Jack Ye <[email protected]> wrote:

> Thanks everyone for the help in organizing the thoughts!
>
> I have moved the summary of everyone's comments here also to the doc that
> Jan linked under question 0. We can continue to have more discussions there
> and cast votes!
>
> Best,
> Jack Ye
>
> On Wed, Feb 21, 2024 at 12:14 PM Jan Kaul <[email protected]>
> wrote:
>
>> Thanks Micah, I think the voting chips are great.
>>
>> @Szehon, actually what I had in mind was not to have one thread per
>> question but rather have smaller threads that can be resolved more easily.
>> I have the fear that one thread for the current question would lead to a
>> very long and unmanageable discussion.
>>
>> I've added another row to the table where everyone could provide a
>> summary of their reason for choosing a certain design. This way we could
>> move some of the content from the comment threads to the main document.
>> On 21.02.24 19:58, Micah Kornfield wrote:
>>
>> Of course we also need threads that express our preferences (voting). I
>>> would suggest to keep these separate from discussions about single points
>>> so that they can be persisted in the document.
>>
>>
>> Not sure if it helpful, but I added voting chips Question 0, as maybe an
>> easier way to keep track of votes.  If it is helpful, I can add them in
>> other places that still need a vote (I think one needs a paid Google Docs
>> account to insert them).
>>
>> Thanks,
>> Micah
>>
>> On Wed, Feb 21, 2024 at 10:23 AM Szehon Ho <[email protected]>
>> wrote:
>>
>>> Thanks Jan.  +1 on having just one thread per question for
>>> vote/preference.  Where do you suggest we have it, on the discussion
>>> question itself?  It would be to keep the existing threads and move it
>>> there.
>>>
>>> Also, I think it makes sense with making a slack channel (for quick
>>> question, reply) , and also discuss unresolved questions in the next week's
>>> sync or a separate meeting.
>>>
>>> On Wed, Feb 21, 2024 at 12:40 AM Jan Kaul <[email protected]>
>>> <[email protected]> wrote:
>>>
>>>> Thank you Jack for driving the consensus for the MV spec and thank you
>>>> all for the discussion.
>>>>
>>>> I really like the idea about incremental consensus because we often
>>>> loose sight in detailed discussions. As Jack mentioned, the highest
>>>> priority question currently is:
>>>>
>>>> *Should the Iceberg MV be realized as a view + storage table or do we
>>>> define a new metadata format? *To have one place for the discussion, I
>>>> created another Question (
>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi)
>>>> to the Materialized View Spec google document.
>>>>
>>>> To improve the visibility of the arguments I would like to propose a
>>>> new process. It would be great if all relevant information is stored in the
>>>> document itself. Therefore I would suggest to use the comment threads for
>>>> smaller, temporary discussions which can be resolved by adding the points
>>>> to the main document. Please close the threads if the information was added
>>>> to the document. Additionally, I gave you all permissions to edit the
>>>> documents, so you can add missing points yourselves.
>>>>
>>>> Of course we also need threads that express our preferences (voting). I
>>>> would suggest to keep these separate from discussions about single points
>>>> so that they can be persisted in the document.
>>>>
>>>> After a phase of collecting arguments for the different designs I think
>>>> it would make sense to have video call to have a face to face discussion.
>>>>
>>>> What do you think?
>>>>
>>>> Best wishes,
>>>>
>>>> Jan
>>>> On 20.02.24 21:32, Manish Malhotra wrote:
>>>>
>>>> Very excited for MV to be in Iceberg :)
>>>> Keeping in the same doc. would be helpful, to have the trail.
>>>> But also agreed, if there are too many directions/threads, then keep
>>>> closing the old one, if there are no more questions.
>>>> And put down the assumptions for the initial version to move forward.
>>>>
>>>>
>>>> On Tue, Feb 20, 2024 at 12:17 PM Walaa Eldin Moustafa <
>>>> [email protected]> wrote:
>>>>
>>>>> I would vote to keep a log in the doc with open questions, and keep
>>>>> the doc updated with open questions as they arise/get resolved.
>>>>>
>>>>> On Tue, Feb 20, 2024 at 11:37 AM Jack Ye <[email protected]> wrote:
>>>>>
>>>>>> Thanks for the response from everyone!
>>>>>>
>>>>>> Before proceeding further, I see a few people referring back to the
>>>>>> current design from Jan. I specifically raised this thread based on the
>>>>>> information in the doc and a few latest discussions we had there. Because
>>>>>> there are many threads in the doc, and each thread points further to 
>>>>>> other
>>>>>> discussion threads in the same doc or other doc, it is now quite hard to
>>>>>> follow and continue discussing all different topics there.
>>>>>>
>>>>>> I hope we can make incremental consensus of the questions in the doc
>>>>>> through devlist, because it provides more visibility, and also a single
>>>>>> thread instead of multiple threads going on at the same time. If we think
>>>>>> this format is not effective, I propose that we create a new mv channel 
>>>>>> in
>>>>>> Iceberg Slack workspace, and people interested can join and discuss all
>>>>>> these points directly. What do we think?
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 19, 2024 at 6:03 PM Szehon Ho <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Great to see more discussion on the MV spec.  Actually, Jan's
>>>>>>> document "Iceberg Materialized View Spec"
>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A>
>>>>>>>  has
>>>>>>> been organized , with a "Design Questions" section to track these 
>>>>>>> debates,
>>>>>>> and it would be nice to centralize the debates there, as Micah mentions.
>>>>>>>
>>>>>>> For Dan's question, I think this debate was tracked in "Design Question
>>>>>>> 3: Should the storage table be registered in the catalog?". I think the
>>>>>>> general idea there was to not expose it directly via Catalog as it is 
>>>>>>> then
>>>>>>> exposed to user modification. If the engine wants to access anything 
>>>>>>> about
>>>>>>> the storage table (including audit and storage), it is of course there 
>>>>>>> via
>>>>>>> the storage table pointer. I think Walaa's point is also good, we could
>>>>>>> expose it as we expose metadata tables, but I am still not sure if 
>>>>>>> there is
>>>>>>> still some use-cases of engine access not covered?
>>>>>>>
>>>>>>> It is true that for Jack's initial question (Do we really want to go
>>>>>>> with the MV = view + storage table design approach for Iceberg MV?),
>>>>>>> unfortunately we did not capture it as a "Design Question" in Jan's 
>>>>>>> doc, as
>>>>>>> it was an implicit assumption of 'yes', because it is the choice of 
>>>>>>> Hive,
>>>>>>> Trino, and other engines , as others have pointed out.
>>>>>>>
>>>>>>> Jack's point about potential evolution of MV (like to add
>>>>>>> partitioning) is an interesting one, but definitely hard to grasp.  I 
>>>>>>> think
>>>>>>> it makes sense to add this as a separate Design Question in the doc, and
>>>>>>> add the options.  This will allow us to flesh out this alternative
>>>>>>> option(s).  Maybe Micah's point about modifying existing proposal to
>>>>>>> 'embed' the required table metadata fields in the existing view 
>>>>>>> metadata,
>>>>>>> is one middle ground option.  Or we add a totally new MV object spec for
>>>>>>> MV, separate than existing View spec?
>>>>>>>
>>>>>>> Also , as Jack pointed out, it may make sense to have the REST /
>>>>>>> Catalog API proposal in the doc to educate the above decision.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Szehon
>>>>>>>
>>>>>>> On Mon, Feb 19, 2024 at 4:08 PM Walaa Eldin Moustafa <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I think it would help if we answer the question of whether an MV is
>>>>>>>> a view + storage table (and degree of exposing this underlying
>>>>>>>> implementation) in the context of the user interfacing with those 
>>>>>>>> concepts:
>>>>>>>>
>>>>>>>> For the end user, interfacing with the engine APIs (e.g., through
>>>>>>>> SQL), materialized view APIs should be almost the same as regular view 
>>>>>>>> APIs
>>>>>>>> (except for operations specific to materialized views like REFRESH 
>>>>>>>> command
>>>>>>>> etc). Typically, the end user interacts with the (materialized) view 
>>>>>>>> object
>>>>>>>> as a view, and the engine performs the abstraction over the storage 
>>>>>>>> table.
>>>>>>>>
>>>>>>>> For the engines interfacing with Iceberg, it sounds the correct
>>>>>>>> abstraction at this layer is indeed view + storage table, and engines 
>>>>>>>> could
>>>>>>>> have access to both objects to optimize queries.
>>>>>>>>
>>>>>>>> So in a sense, the engine will ultimately hide most of the
>>>>>>>> storage detail from the end user (except for advanced users who want to
>>>>>>>> explicitly access the storage table with a modifier like
>>>>>>>> "db.view.storageTable" -- and they can only read it), while Iceberg 
>>>>>>>> will
>>>>>>>> expose the storage details to the engine catalog to use it in scans if
>>>>>>>> needed. So the storage table is hidden or exposed based on the 
>>>>>>>> context/the
>>>>>>>> actual users. From Iceberg point of view (which interacts with the
>>>>>>>> engines), the storage table is exposed. Note that this does not
>>>>>>>> necessarily mean that the storage table is registered in the catalog 
>>>>>>>> with
>>>>>>>> its own independent name (e.g., where we can drop the view but keep the
>>>>>>>> storage table and access it from the catalog). Addressing the storage 
>>>>>>>> table
>>>>>>>> using a virtual namespace like "db.view.storageTable" sounds like a 
>>>>>>>> good
>>>>>>>> middle ground. Anyways, end users should not need to directly access 
>>>>>>>> the
>>>>>>>> storage table in most cases.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>> On Mon, Feb 19, 2024 at 3:38 PM Micah Kornfield <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Jack,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> In my mind, the first key point we all need to agree upon to move
>>>>>>>>>> this design forward is*: Do we really want to go with the MV =
>>>>>>>>>> view + storage table design approach for Iceberg MV?*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think we want this to the extent that we do not want to redefine
>>>>>>>>> the same concept with different representations/naming to the greatest
>>>>>>>>> degree possible.  This is why borrowing the concepts from the view 
>>>>>>>>> (e.g.
>>>>>>>>> multiple ways of expressing the same view logic in different 
>>>>>>>>> dialects) and
>>>>>>>>> aspects of the materialized data (e.g. partitioning, ordering) feels 
>>>>>>>>> most
>>>>>>>>> natural.  IIUC your proposal, I think you are saying maybe two
>>>>>>>>> modifications to the existing proposals in the document:
>>>>>>>>>
>>>>>>>>> 1.  No separate storage table link, instead embed most of the
>>>>>>>>> metadata of the materialized table into the MV document (the exception
>>>>>>>>> seems to be snapshot history)
>>>>>>>>> 2.  For snapshot history, have one unified history specific to the
>>>>>>>>> MV.
>>>>>>>>>
>>>>>>>>> This seems fairly reasonable to me and I think I can solve some
>>>>>>>>> challenges with the existing proposal in an elegant way.  If this is
>>>>>>>>> correct (or maybe if it isn't quite correct) perhaps you can make
>>>>>>>>> suggestions to the document so all of the trade-offs can be discussed 
>>>>>>>>> in
>>>>>>>>> one place?
>>>>>>>>>
>>>>>>>>> I think the one thing the current draft of the materialized view
>>>>>>>>> ignores is how to store algebraic summaries (e.g. separate sum and 
>>>>>>>>> count
>>>>>>>>> for averages, or other sketches), so that new data can be 
>>>>>>>>> incrementally
>>>>>>>>> incorporated.  But representing these structures feels like it 
>>>>>>>>> potentially
>>>>>>>>> has value beyond just MVs (e.g. it can be a natural way to express 
>>>>>>>>> summary
>>>>>>>>> statistics in table metadata), so I think it deserves at least a try 
>>>>>>>>> in
>>>>>>>>> incorporating the concepts in the table specification, so the 
>>>>>>>>> definitions
>>>>>>>>> can be shared.  I was imagining this could come as part of the next
>>>>>>>>> revision of MV specification.
>>>>>>>>>
>>>>>>>>> The MV internal structure could evolve in a way that works more
>>>>>>>>>> efficiently with the reduced scope of functionalities, without 
>>>>>>>>>> relying on
>>>>>>>>>> table to offer the same capabilities. I can at least say that is 
>>>>>>>>>> true based
>>>>>>>>>> on my internal knowledge of how Redshift MVs work.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm not sure I fully understand this point, but it seems the main
>>>>>>>>> question here is what would break if it started to evolve in this
>>>>>>>>> direction.  Is it purely additive or do we suspect some elements 
>>>>>>>>> would need
>>>>>>>>> to be removed?  My gut feeling here is the main concerns here are  
>>>>>>>>> getting
>>>>>>>>> the cardinatities correct (i.e. 1 MV should probably have 0, 1 or more
>>>>>>>>> materialized storage tables associated with it, to support more 
>>>>>>>>> advanced
>>>>>>>>> algebraic structures listed above, and perhaps a second without them, 
>>>>>>>>> and
>>>>>>>>> additional metadata to distinguish between these two different modes).
>>>>>>>>>
>>>>>>>>> If after the evaluation, we are confident that the MV = view +
>>>>>>>>>> storage table approach is the right way to go, then we can debate 
>>>>>>>>>> the other
>>>>>>>>>> issues, and I think the next issue to reach consensus should be 
>>>>>>>>>> "Should the
>>>>>>>>>> storage table be registered in the catalog?".
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I actually think there are actually more fundamental questions
>>>>>>>>> posed:
>>>>>>>>> 1.  Should be considering how items should be modelled in the REST
>>>>>>>>> API concurrently with the Iceberg spec, as that potentially impacts 
>>>>>>>>> design
>>>>>>>>> decision (I think the answer is yes, and we should update the doc with
>>>>>>>>> sketches on new endpoints and operations on the endpoints to ensure 
>>>>>>>>> things
>>>>>>>>> align).
>>>>>>>>> 2.  Going forward should new aspects of Iceberg artifacts rely on
>>>>>>>>> the fact that a catalog is present and we can rely on a naming 
>>>>>>>>> convention
>>>>>>>>> for looking up other artifacts in a catalog as pointers (I lean yes on
>>>>>>>>> this, but I'm a little bit more ambivalent).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Micah
>>>>>>>>>
>>>>>>>>> On Mon, Feb 19, 2024 at 12:52 PM Jack Ye <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I suggest we need a step-by-step process to make incremental
>>>>>>>>>> consensus, otherwise we are constantly talking about many different 
>>>>>>>>>> debates
>>>>>>>>>> at the same time.
>>>>>>>>>>
>>>>>>>>>> In my mind, the first key point we all need to agree upon to move
>>>>>>>>>> this design forward is*: Do we really want to go with the MV =
>>>>>>>>>> view + storage table design approach for Iceberg MV?*
>>>>>>>>>>
>>>>>>>>>> I think we (at least me) started with this assumption, mostly
>>>>>>>>>> because this is how Trino implements MV, and how Hive tables store MV
>>>>>>>>>> information today. But does it mean we should design it that way in 
>>>>>>>>>> Iceberg?
>>>>>>>>>>
>>>>>>>>>> Now I look back at how we did the view spec design, we could also
>>>>>>>>>> say that we just add a representation field in the table spec to 
>>>>>>>>>> store
>>>>>>>>>> view, and an Iceberg view is just a table with no data but with
>>>>>>>>>> representations defined. But we did not do that. So it feels now 
>>>>>>>>>> quite
>>>>>>>>>> inconsistent to say we want to just add a few fields in the table 
>>>>>>>>>> and view
>>>>>>>>>> spec to call it an Iceberg MV.
>>>>>>>>>>
>>>>>>>>>> If we look into most of the other database systems (e.g.
>>>>>>>>>> Redshift, BigQuery, Snowflake), they never expose such implementation
>>>>>>>>>> details like storage table. Apart from being close-sourced systems, 
>>>>>>>>>> I think
>>>>>>>>>> it is also for good technical reasons. There are many more things 
>>>>>>>>>> that a
>>>>>>>>>> table needs to support, but does not really apply to MV. The MV 
>>>>>>>>>> internal
>>>>>>>>>> structure could evolve in a way that works more efficiently with the
>>>>>>>>>> reduced scope of functionalities, without relying on table to offer 
>>>>>>>>>> the
>>>>>>>>>> same capabilities. I can at least say that is true based on my 
>>>>>>>>>> internal
>>>>>>>>>> knowledge of how Redshift MVs work.
>>>>>>>>>>
>>>>>>>>>> I think we should fully evaluate both directions, and commit to
>>>>>>>>>> one first before debating more things.
>>>>>>>>>>
>>>>>>>>>> If we have a new and independent Iceberg MV spec, then an Iceberg
>>>>>>>>>> MV is under-the-hood a single object containing all MV information. 
>>>>>>>>>> It has
>>>>>>>>>> its own name, snapshots, view representation, etc. I don't believe 
>>>>>>>>>> we will
>>>>>>>>>> be blocked by Trino due to its MV SPIs currently requiring the 
>>>>>>>>>> existence of
>>>>>>>>>> a storage table, as it will just be a different implementation from 
>>>>>>>>>> the
>>>>>>>>>> existing one in Trino-Iceberg. In this direction, I don't think we 
>>>>>>>>>> need to
>>>>>>>>>> have any further debate about pointers, metadata locations, storage 
>>>>>>>>>> table,
>>>>>>>>>> etc. because everything will be new.
>>>>>>>>>>
>>>>>>>>>> If after the evaluation, we are confident that the MV = view +
>>>>>>>>>> storage table approach is the right way to go, then we can debate 
>>>>>>>>>> the other
>>>>>>>>>> issues, and I think the next issue to reach consensus should be 
>>>>>>>>>> "Should the
>>>>>>>>>> storage table be registered in the catalog?".
>>>>>>>>>>
>>>>>>>>>> What do we think?
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 19, 2024 at 11:32 AM Daniel Weeks <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Jack,
>>>>>>>>>>>
>>>>>>>>>>> I think we should consider either allowing the storage table to
>>>>>>>>>>> be fully exposed/addressable via the catalog or allow access via
>>>>>>>>>>> namespacing like with metadata tables.  E.g.
>>>>>>>>>>> <catalog>.<database>.<table>.<storage>, which would allow for full 
>>>>>>>>>>> access
>>>>>>>>>>> to the underlying table.
>>>>>>>>>>>
>>>>>>>>>>> For other engines to interact with the storage table (e.g. to
>>>>>>>>>>> execute the query to materialize the table), it may be necessary 
>>>>>>>>>>> that the
>>>>>>>>>>> table is fully addressable.  Whether the storage table is returned 
>>>>>>>>>>> as part
>>>>>>>>>>> of list operations may be something we leave up to the catalog
>>>>>>>>>>> implementation.
>>>>>>>>>>>
>>>>>>>>>>> I don't think the table should reference a physical location
>>>>>>>>>>> (only a logical reference) since things will be changing behind the 
>>>>>>>>>>> view
>>>>>>>>>>> definition and I'm not confident we want to have to update the view
>>>>>>>>>>> representation everytime the storage table is updated.
>>>>>>>>>>>
>>>>>>>>>>> I think there's still some exploration as to whether we need to
>>>>>>>>>>> model this as separate from view endpoints, but there may be enough 
>>>>>>>>>>> overlap
>>>>>>>>>>> that it's not necessary to have yet another set of endpoints for
>>>>>>>>>>> materialized views (maybe filter params if you need to 
>>>>>>>>>>> distinguish?).
>>>>>>>>>>>
>>>>>>>>>>> -Dan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Feb 18, 2024 at 6:57 PM Renjie Liu <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, Jack:
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for raising this.
>>>>>>>>>>>>
>>>>>>>>>>>> In most database systems, MV, view and table are considered
>>>>>>>>>>>>> independent objects, at least at API level. It is very rare for a 
>>>>>>>>>>>>> system to
>>>>>>>>>>>>> support operations like "materializing a logical view" or 
>>>>>>>>>>>>> "upgrading a
>>>>>>>>>>>>> logical view to MV", because view and MV are very different in 
>>>>>>>>>>>>> almost every
>>>>>>>>>>>>> aspect of user experience. Extending the existing view or table 
>>>>>>>>>>>>> spec to
>>>>>>>>>>>>> accommodate MV might give us a MV implementation similar to the 
>>>>>>>>>>>>> current
>>>>>>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, 
>>>>>>>>>>>>> but it
>>>>>>>>>>>>> binds us to a very specific design of MV, which we might regret 
>>>>>>>>>>>>> in the
>>>>>>>>>>>>> future.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> When I reviewed the doc, I thought we were discussing the spec
>>>>>>>>>>>> of materialized view, just like the spec of table metadata, but 
>>>>>>>>>>>> didn't not
>>>>>>>>>>>> the user facing api. I would definitely agree that we should 
>>>>>>>>>>>> consider MV as
>>>>>>>>>>>> another kind of database object in user facing api, even though 
>>>>>>>>>>>> it's
>>>>>>>>>>>> internally modelled as a view + storage table pointer.
>>>>>>>>>>>>
>>>>>>>>>>>> If we want to make the REST experience good for MV, I think we
>>>>>>>>>>>>> should at least consider directly describing the full metadata of 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON file.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Do you mean we need to add components like
>>>>>>>>>>>> `LoadMaterializedViewResponse`, if so, I would +1 for this.
>>>>>>>>>>>>
>>>>>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with
>>>>>>>>>>>>> MVs?*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> As I have mentioned above,  I think we should consider MV as
>>>>>>>>>>>> another database object, so I think we should add a set of apis
>>>>>>>>>>>> specifically designed for MV, such as `loadMV`, `freshMV`.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Feb 17, 2024 at 11:14 AM Jack Ye <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As we are discussing the spec change for materialized view,
>>>>>>>>>>>>> there has been a missing aspect that is technically also related, 
>>>>>>>>>>>>> and might
>>>>>>>>>>>>> affect the MV spec design: *how do we want to add MV support
>>>>>>>>>>>>> to the REST spec?*
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to discuss this in a new thread to collect
>>>>>>>>>>>>> people's thoughts. This topic expands to the following 2 
>>>>>>>>>>>>> sub-questions:
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Q1: how would the MV spec change affect the REST spec?*
>>>>>>>>>>>>> In the current proposal, it looks like we are using a design
>>>>>>>>>>>>> where a MV is modeled as an Iceberg view linking to an Iceberg 
>>>>>>>>>>>>> storage
>>>>>>>>>>>>> table. At the same time, we do not want to expose this storage 
>>>>>>>>>>>>> table in the
>>>>>>>>>>>>> catalog, thus the Iceberg view has a pointer to only a metadata 
>>>>>>>>>>>>> JSON file
>>>>>>>>>>>>> of the Iceberg storage table. Each MV refresh updates the pointer 
>>>>>>>>>>>>> to a new
>>>>>>>>>>>>> metadata JSON file.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I feel this does not play very well with the direction that
>>>>>>>>>>>>> REST is going. The REST catalog is trying to remove the 
>>>>>>>>>>>>> dependency to the
>>>>>>>>>>>>> metadata JSON file. For example, in LoadTableResponse the only 
>>>>>>>>>>>>> required
>>>>>>>>>>>>> field is the metadata, and metadata-location is actually optional.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we want to make the REST experience good for MV, I think we
>>>>>>>>>>>>> should at least consider directly describing the full metadata of 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON file.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with
>>>>>>>>>>>>> MVs?*
>>>>>>>>>>>>> So far we have been thinking about amending the view spec to
>>>>>>>>>>>>> accommodate MV. This entails likely having MVs also being handled 
>>>>>>>>>>>>> through
>>>>>>>>>>>>> the view APIs in REST spec.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We need to agree with that first in the community, because
>>>>>>>>>>>>> this has various implications, and I am not really sure at this 
>>>>>>>>>>>>> point if it
>>>>>>>>>>>>> is the best way to go.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If MV interactions are through the view APIs, the view APIs
>>>>>>>>>>>>> need to be updated to accommodate MV constructs that are not 
>>>>>>>>>>>>> really related
>>>>>>>>>>>>> to logical views. In fact, most actions performed on MVs are more 
>>>>>>>>>>>>> similar
>>>>>>>>>>>>> to actions performed on table rather than view, which involve 
>>>>>>>>>>>>> configuring
>>>>>>>>>>>>> data layout, read and write constructs. For example, users might 
>>>>>>>>>>>>> run
>>>>>>>>>>>>> something like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> CREATE MATERIALIZED VIEW mv
>>>>>>>>>>>>> PARTITION BY col1
>>>>>>>>>>>>> CLUSTER BY col2
>>>>>>>>>>>>> AS ( // some sql )
>>>>>>>>>>>>>
>>>>>>>>>>>>> then the CreateView API needs to accept partition spec and
>>>>>>>>>>>>> sort order that are completely not relevant for logical views.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When reading a MV, we might even want to have a
>>>>>>>>>>>>> PlanMaterializedView API similar to the PlanTable API we are 
>>>>>>>>>>>>> adding.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *My personal take*
>>>>>>>>>>>>> It feels like we need to reconsider the question of what is
>>>>>>>>>>>>> the best way to model MV in Iceberg. Should it be (1) a view 
>>>>>>>>>>>>> linked to a
>>>>>>>>>>>>> storage table, or (2) a table with a view SQL associated with it, 
>>>>>>>>>>>>> or (3)
>>>>>>>>>>>>> it's a completely independent thing. This topic was discussed in 
>>>>>>>>>>>>> the past in
>>>>>>>>>>>>> this doc
>>>>>>>>>>>>> <https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/edit?pli=1>,
>>>>>>>>>>>>> but at that time we did not have much perspective about aspects 
>>>>>>>>>>>>> like REST
>>>>>>>>>>>>> spec, and the view integration was also not fully completed yet. 
>>>>>>>>>>>>> With the
>>>>>>>>>>>>> new knowledge, currently I am actually leaning a bit more towards 
>>>>>>>>>>>>> (3).
>>>>>>>>>>>>>
>>>>>>>>>>>>> In most database systems, MV, view and table are considered
>>>>>>>>>>>>> independent objects, at least at API level. It is very rare for a 
>>>>>>>>>>>>> system to
>>>>>>>>>>>>> support operations like "materializing a logical view" or 
>>>>>>>>>>>>> "upgrading a
>>>>>>>>>>>>> logical view to MV", because view and MV are very different in 
>>>>>>>>>>>>> almost every
>>>>>>>>>>>>> aspect of user experience. Extending the existing view or table 
>>>>>>>>>>>>> spec to
>>>>>>>>>>>>> accommodate MV might give us a MV implementation similar to the 
>>>>>>>>>>>>> current
>>>>>>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, 
>>>>>>>>>>>>> but it
>>>>>>>>>>>>> binds us to a very specific design of MV, which we might regret 
>>>>>>>>>>>>> in the
>>>>>>>>>>>>> future.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we make a new MV spec, it can be made up of fields that
>>>>>>>>>>>>> already exist in the table and view specs, but it is a whole new 
>>>>>>>>>>>>> spec. In
>>>>>>>>>>>>> this way, the spec can evolve independently to accommodate MV 
>>>>>>>>>>>>> specific
>>>>>>>>>>>>> features, and we can also create MV-related REST endpoints that 
>>>>>>>>>>>>> will evolve
>>>>>>>>>>>>> independently from table and view REST APIs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But on the other side it is definitely associated with more
>>>>>>>>>>>>> work to maintain a new spec, and potentially big refactoring in 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> codebase to make sure operations today that work on table or view 
>>>>>>>>>>>>> can now
>>>>>>>>>>>>> support MV as a different object. And it definitely has other 
>>>>>>>>>>>>> problems that
>>>>>>>>>>>>> I have overlooked. I would greatly appreciate any thoughts about 
>>>>>>>>>>>>> this!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: Materialized view integration with REST spec

Reply via email to