Hi Jan I agree with Walaa, I think the new Question should be narrow (View = View + Materialization, or new MV metadata), with 3 options (Materialization can be metadata.json or nested object).
We can mention that with the former, we have another decision whether to register it (and then refer to Question 3 already discussed in the document). Otherwise we will have n^2 options here and its hard to understand. What do you think? Thanks Szehon On Thu, Feb 22, 2024 at 1:52 AM Jan Kaul <jank...@mailbox.org.invalid> wrote: > My motivation for the current table is to answer the question: > > *Do we use a View + a Storage Table or do we define a new MV metadata > format? *To be able to provide meaningful arguments about the View + > Storage Table option, I split it into multiple options. Otherwise arguments > would always need to include an additional condition like: > > The downside of the View + Storage Table design is that two entities have > to be registered in the catalog, if the storage table metadata is not > stored as a JSON file or as an internal field. > > We can come back to the more granular questions once the aforementioned > question is answered. > On 22.02.24 06:04, Walaa Eldin Moustafa wrote: > > Thanks Jack! I feel Question 0 is very broad, essentially capturing the > whole design. Can we start by discussing more granular questions? > > On Wed, Feb 21, 2024 at 8:53 PM Jack Ye <yezhao...@gmail.com> wrote: > >> Thanks everyone for the help in organizing the thoughts! >> >> I have moved the summary of everyone's comments here also to the doc that >> Jan linked under question 0. We can continue to have more discussions there >> and cast votes! >> >> Best, >> Jack Ye >> >> On Wed, Feb 21, 2024 at 12:14 PM Jan Kaul <jank...@mailbox.org.invalid> >> <jank...@mailbox.org.invalid> wrote: >> >>> Thanks Micah, I think the voting chips are great. >>> >>> @Szehon, actually what I had in mind was not to have one thread per >>> question but rather have smaller threads that can be resolved more easily. >>> I have the fear that one thread for the current question would lead to a >>> very long and unmanageable discussion. >>> >>> I've added another row to the table where everyone could provide a >>> summary of their reason for choosing a certain design. This way we could >>> move some of the content from the comment threads to the main document. >>> On 21.02.24 19:58, Micah Kornfield wrote: >>> >>> Of course we also need threads that express our preferences (voting). I >>>> would suggest to keep these separate from discussions about single points >>>> so that they can be persisted in the document. >>> >>> >>> Not sure if it helpful, but I added voting chips Question 0, as maybe an >>> easier way to keep track of votes. If it is helpful, I can add them in >>> other places that still need a vote (I think one needs a paid Google Docs >>> account to insert them). >>> >>> Thanks, >>> Micah >>> >>> On Wed, Feb 21, 2024 at 10:23 AM Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>> >>>> Thanks Jan. +1 on having just one thread per question for >>>> vote/preference. Where do you suggest we have it, on the discussion >>>> question itself? It would be to keep the existing threads and move it >>>> there. >>>> >>>> Also, I think it makes sense with making a slack channel (for quick >>>> question, reply) , and also discuss unresolved questions in the next week's >>>> sync or a separate meeting. >>>> >>>> On Wed, Feb 21, 2024 at 12:40 AM Jan Kaul <jank...@mailbox.org.invalid> >>>> <jank...@mailbox.org.invalid> wrote: >>>> >>>>> Thank you Jack for driving the consensus for the MV spec and thank you >>>>> all for the discussion. >>>>> >>>>> I really like the idea about incremental consensus because we often >>>>> loose sight in detailed discussions. As Jack mentioned, the highest >>>>> priority question currently is: >>>>> >>>>> *Should the Iceberg MV be realized as a view + storage table or do we >>>>> define a new metadata format? *To have one place for the discussion, >>>>> I created another Question ( >>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi) >>>>> to the Materialized View Spec google document. >>>>> >>>>> To improve the visibility of the arguments I would like to propose a >>>>> new process. It would be great if all relevant information is stored in >>>>> the >>>>> document itself. Therefore I would suggest to use the comment threads for >>>>> smaller, temporary discussions which can be resolved by adding the points >>>>> to the main document. Please close the threads if the information was >>>>> added >>>>> to the document. Additionally, I gave you all permissions to edit the >>>>> documents, so you can add missing points yourselves. >>>>> >>>>> Of course we also need threads that express our preferences (voting). >>>>> I would suggest to keep these separate from discussions about single >>>>> points >>>>> so that they can be persisted in the document. >>>>> >>>>> After a phase of collecting arguments for the different designs I >>>>> think it would make sense to have video call to have a face to face >>>>> discussion. >>>>> >>>>> What do you think? >>>>> >>>>> Best wishes, >>>>> >>>>> Jan >>>>> On 20.02.24 21:32, Manish Malhotra wrote: >>>>> >>>>> Very excited for MV to be in Iceberg :) >>>>> Keeping in the same doc. would be helpful, to have the trail. >>>>> But also agreed, if there are too many directions/threads, then keep >>>>> closing the old one, if there are no more questions. >>>>> And put down the assumptions for the initial version to move forward. >>>>> >>>>> >>>>> On Tue, Feb 20, 2024 at 12:17 PM Walaa Eldin Moustafa < >>>>> wa.moust...@gmail.com> wrote: >>>>> >>>>>> I would vote to keep a log in the doc with open questions, and keep >>>>>> the doc updated with open questions as they arise/get resolved. >>>>>> >>>>>> On Tue, Feb 20, 2024 at 11:37 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> Thanks for the response from everyone! >>>>>>> >>>>>>> Before proceeding further, I see a few people referring back to the >>>>>>> current design from Jan. I specifically raised this thread based on the >>>>>>> information in the doc and a few latest discussions we had there. >>>>>>> Because >>>>>>> there are many threads in the doc, and each thread points further to >>>>>>> other >>>>>>> discussion threads in the same doc or other doc, it is now quite hard to >>>>>>> follow and continue discussing all different topics there. >>>>>>> >>>>>>> I hope we can make incremental consensus of the questions in the doc >>>>>>> through devlist, because it provides more visibility, and also a single >>>>>>> thread instead of multiple threads going on at the same time. If we >>>>>>> think >>>>>>> this format is not effective, I propose that we create a new mv channel >>>>>>> in >>>>>>> Iceberg Slack workspace, and people interested can join and discuss all >>>>>>> these points directly. What do we think? >>>>>>> >>>>>>> Best, >>>>>>> Jack Ye >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 19, 2024 at 6:03 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Great to see more discussion on the MV spec. Actually, Jan's >>>>>>>> document "Iceberg Materialized View Spec" >>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A> >>>>>>>> has >>>>>>>> been organized , with a "Design Questions" section to track these >>>>>>>> debates, >>>>>>>> and it would be nice to centralize the debates there, as Micah >>>>>>>> mentions. >>>>>>>> >>>>>>>> For Dan's question, I think this debate was tracked in "Design Question >>>>>>>> 3: Should the storage table be registered in the catalog?". I think the >>>>>>>> general idea there was to not expose it directly via Catalog as it is >>>>>>>> then >>>>>>>> exposed to user modification. If the engine wants to access anything >>>>>>>> about >>>>>>>> the storage table (including audit and storage), it is of course there >>>>>>>> via >>>>>>>> the storage table pointer. I think Walaa's point is also good, we could >>>>>>>> expose it as we expose metadata tables, but I am still not sure if >>>>>>>> there is >>>>>>>> still some use-cases of engine access not covered? >>>>>>>> >>>>>>>> It is true that for Jack's initial question (Do we really want to >>>>>>>> go with the MV = view + storage table design approach for Iceberg MV?), >>>>>>>> unfortunately we did not capture it as a "Design Question" in Jan's >>>>>>>> doc, as >>>>>>>> it was an implicit assumption of 'yes', because it is the choice of >>>>>>>> Hive, >>>>>>>> Trino, and other engines , as others have pointed out. >>>>>>>> >>>>>>>> Jack's point about potential evolution of MV (like to add >>>>>>>> partitioning) is an interesting one, but definitely hard to grasp. I >>>>>>>> think >>>>>>>> it makes sense to add this as a separate Design Question in the doc, >>>>>>>> and >>>>>>>> add the options. This will allow us to flesh out this alternative >>>>>>>> option(s). Maybe Micah's point about modifying existing proposal to >>>>>>>> 'embed' the required table metadata fields in the existing view >>>>>>>> metadata, >>>>>>>> is one middle ground option. Or we add a totally new MV object spec >>>>>>>> for >>>>>>>> MV, separate than existing View spec? >>>>>>>> >>>>>>>> Also , as Jack pointed out, it may make sense to have the REST / >>>>>>>> Catalog API proposal in the doc to educate the above decision. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Szehon >>>>>>>> >>>>>>>> On Mon, Feb 19, 2024 at 4:08 PM Walaa Eldin Moustafa < >>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I think it would help if we answer the question of whether an MV >>>>>>>>> is a view + storage table (and degree of exposing this underlying >>>>>>>>> implementation) in the context of the user interfacing with those >>>>>>>>> concepts: >>>>>>>>> >>>>>>>>> For the end user, interfacing with the engine APIs (e.g., through >>>>>>>>> SQL), materialized view APIs should be almost the same as regular >>>>>>>>> view APIs >>>>>>>>> (except for operations specific to materialized views like REFRESH >>>>>>>>> command >>>>>>>>> etc). Typically, the end user interacts with the (materialized) view >>>>>>>>> object >>>>>>>>> as a view, and the engine performs the abstraction over the storage >>>>>>>>> table. >>>>>>>>> >>>>>>>>> For the engines interfacing with Iceberg, it sounds the correct >>>>>>>>> abstraction at this layer is indeed view + storage table, and engines >>>>>>>>> could >>>>>>>>> have access to both objects to optimize queries. >>>>>>>>> >>>>>>>>> So in a sense, the engine will ultimately hide most of the >>>>>>>>> storage detail from the end user (except for advanced users who want >>>>>>>>> to >>>>>>>>> explicitly access the storage table with a modifier like >>>>>>>>> "db.view.storageTable" -- and they can only read it), while Iceberg >>>>>>>>> will >>>>>>>>> expose the storage details to the engine catalog to use it in scans if >>>>>>>>> needed. So the storage table is hidden or exposed based on the >>>>>>>>> context/the >>>>>>>>> actual users. From Iceberg point of view (which interacts with the >>>>>>>>> engines), the storage table is exposed. Note that this does not >>>>>>>>> necessarily mean that the storage table is registered in the catalog >>>>>>>>> with >>>>>>>>> its own independent name (e.g., where we can drop the view but keep >>>>>>>>> the >>>>>>>>> storage table and access it from the catalog). Addressing the storage >>>>>>>>> table >>>>>>>>> using a virtual namespace like "db.view.storageTable" sounds like a >>>>>>>>> good >>>>>>>>> middle ground. Anyways, end users should not need to directly access >>>>>>>>> the >>>>>>>>> storage table in most cases. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Walaa. >>>>>>>>> >>>>>>>>> On Mon, Feb 19, 2024 at 3:38 PM Micah Kornfield < >>>>>>>>> emkornfi...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Jack, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> In my mind, the first key point we all need to agree upon to >>>>>>>>>>> move this design forward is*: Do we really want to go with the >>>>>>>>>>> MV = view + storage table design approach for Iceberg MV?* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think we want this to the extent that we do not want to >>>>>>>>>> redefine the same concept with different representations/naming to >>>>>>>>>> the >>>>>>>>>> greatest degree possible. This is why borrowing the concepts from >>>>>>>>>> the view >>>>>>>>>> (e.g. multiple ways of expressing the same view logic in different >>>>>>>>>> dialects) and aspects of the materialized data (e.g. partitioning, >>>>>>>>>> ordering) feels most natural. IIUC your proposal, I think you are >>>>>>>>>> saying >>>>>>>>>> maybe two modifications to the existing proposals in the document: >>>>>>>>>> >>>>>>>>>> 1. No separate storage table link, instead embed most of the >>>>>>>>>> metadata of the materialized table into the MV document (the >>>>>>>>>> exception >>>>>>>>>> seems to be snapshot history) >>>>>>>>>> 2. For snapshot history, have one unified history specific to >>>>>>>>>> the MV. >>>>>>>>>> >>>>>>>>>> This seems fairly reasonable to me and I think I can solve some >>>>>>>>>> challenges with the existing proposal in an elegant way. If this is >>>>>>>>>> correct (or maybe if it isn't quite correct) perhaps you can make >>>>>>>>>> suggestions to the document so all of the trade-offs can be >>>>>>>>>> discussed in >>>>>>>>>> one place? >>>>>>>>>> >>>>>>>>>> I think the one thing the current draft of the materialized view >>>>>>>>>> ignores is how to store algebraic summaries (e.g. separate sum and >>>>>>>>>> count >>>>>>>>>> for averages, or other sketches), so that new data can be >>>>>>>>>> incrementally >>>>>>>>>> incorporated. But representing these structures feels like it >>>>>>>>>> potentially >>>>>>>>>> has value beyond just MVs (e.g. it can be a natural way to express >>>>>>>>>> summary >>>>>>>>>> statistics in table metadata), so I think it deserves at least a try >>>>>>>>>> in >>>>>>>>>> incorporating the concepts in the table specification, so the >>>>>>>>>> definitions >>>>>>>>>> can be shared. I was imagining this could come as part of the next >>>>>>>>>> revision of MV specification. >>>>>>>>>> >>>>>>>>>> The MV internal structure could evolve in a way that works more >>>>>>>>>>> efficiently with the reduced scope of functionalities, without >>>>>>>>>>> relying on >>>>>>>>>>> table to offer the same capabilities. I can at least say that is >>>>>>>>>>> true based >>>>>>>>>>> on my internal knowledge of how Redshift MVs work. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I'm not sure I fully understand this point, but it seems the main >>>>>>>>>> question here is what would break if it started to evolve in this >>>>>>>>>> direction. Is it purely additive or do we suspect some elements >>>>>>>>>> would need >>>>>>>>>> to be removed? My gut feeling here is the main concerns here are >>>>>>>>>> getting >>>>>>>>>> the cardinatities correct (i.e. 1 MV should probably have 0, 1 or >>>>>>>>>> more >>>>>>>>>> materialized storage tables associated with it, to support more >>>>>>>>>> advanced >>>>>>>>>> algebraic structures listed above, and perhaps a second without >>>>>>>>>> them, and >>>>>>>>>> additional metadata to distinguish between these two different >>>>>>>>>> modes). >>>>>>>>>> >>>>>>>>>> If after the evaluation, we are confident that the MV = view + >>>>>>>>>>> storage table approach is the right way to go, then we can debate >>>>>>>>>>> the other >>>>>>>>>>> issues, and I think the next issue to reach consensus should be >>>>>>>>>>> "Should the >>>>>>>>>>> storage table be registered in the catalog?". >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I actually think there are actually more fundamental questions >>>>>>>>>> posed: >>>>>>>>>> 1. Should be considering how items should be modelled in the >>>>>>>>>> REST API concurrently with the Iceberg spec, as that potentially >>>>>>>>>> impacts >>>>>>>>>> design decision (I think the answer is yes, and we should update the >>>>>>>>>> doc >>>>>>>>>> with sketches on new endpoints and operations on the endpoints to >>>>>>>>>> ensure >>>>>>>>>> things align). >>>>>>>>>> 2. Going forward should new aspects of Iceberg artifacts rely on >>>>>>>>>> the fact that a catalog is present and we can rely on a naming >>>>>>>>>> convention >>>>>>>>>> for looking up other artifacts in a catalog as pointers (I lean yes >>>>>>>>>> on >>>>>>>>>> this, but I'm a little bit more ambivalent). >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Micah >>>>>>>>>> >>>>>>>>>> On Mon, Feb 19, 2024 at 12:52 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I suggest we need a step-by-step process to make incremental >>>>>>>>>>> consensus, otherwise we are constantly talking about many different >>>>>>>>>>> debates >>>>>>>>>>> at the same time. >>>>>>>>>>> >>>>>>>>>>> In my mind, the first key point we all need to agree upon to >>>>>>>>>>> move this design forward is*: Do we really want to go with the >>>>>>>>>>> MV = view + storage table design approach for Iceberg MV?* >>>>>>>>>>> >>>>>>>>>>> I think we (at least me) started with this assumption, mostly >>>>>>>>>>> because this is how Trino implements MV, and how Hive tables store >>>>>>>>>>> MV >>>>>>>>>>> information today. But does it mean we should design it that way in >>>>>>>>>>> Iceberg? >>>>>>>>>>> >>>>>>>>>>> Now I look back at how we did the view spec design, we could >>>>>>>>>>> also say that we just add a representation field in the table spec >>>>>>>>>>> to store >>>>>>>>>>> view, and an Iceberg view is just a table with no data but with >>>>>>>>>>> representations defined. But we did not do that. So it feels now >>>>>>>>>>> quite >>>>>>>>>>> inconsistent to say we want to just add a few fields in the table >>>>>>>>>>> and view >>>>>>>>>>> spec to call it an Iceberg MV. >>>>>>>>>>> >>>>>>>>>>> If we look into most of the other database systems (e.g. >>>>>>>>>>> Redshift, BigQuery, Snowflake), they never expose such >>>>>>>>>>> implementation >>>>>>>>>>> details like storage table. Apart from being close-sourced systems, >>>>>>>>>>> I think >>>>>>>>>>> it is also for good technical reasons. There are many more things >>>>>>>>>>> that a >>>>>>>>>>> table needs to support, but does not really apply to MV. The MV >>>>>>>>>>> internal >>>>>>>>>>> structure could evolve in a way that works more efficiently with the >>>>>>>>>>> reduced scope of functionalities, without relying on table to offer >>>>>>>>>>> the >>>>>>>>>>> same capabilities. I can at least say that is true based on my >>>>>>>>>>> internal >>>>>>>>>>> knowledge of how Redshift MVs work. >>>>>>>>>>> >>>>>>>>>>> I think we should fully evaluate both directions, and commit to >>>>>>>>>>> one first before debating more things. >>>>>>>>>>> >>>>>>>>>>> If we have a new and independent Iceberg MV spec, then an >>>>>>>>>>> Iceberg MV is under-the-hood a single object containing all MV >>>>>>>>>>> information. >>>>>>>>>>> It has its own name, snapshots, view representation, etc. I don't >>>>>>>>>>> believe >>>>>>>>>>> we will be blocked by Trino due to its MV SPIs currently requiring >>>>>>>>>>> the >>>>>>>>>>> existence of a storage table, as it will just be a different >>>>>>>>>>> implementation >>>>>>>>>>> from the existing one in Trino-Iceberg. In this direction, I don't >>>>>>>>>>> think we >>>>>>>>>>> need to have any further debate about pointers, metadata locations, >>>>>>>>>>> storage >>>>>>>>>>> table, etc. because everything will be new. >>>>>>>>>>> >>>>>>>>>>> If after the evaluation, we are confident that the MV = view + >>>>>>>>>>> storage table approach is the right way to go, then we can debate >>>>>>>>>>> the other >>>>>>>>>>> issues, and I think the next issue to reach consensus should be >>>>>>>>>>> "Should the >>>>>>>>>>> storage table be registered in the catalog?". >>>>>>>>>>> >>>>>>>>>>> What do we think? >>>>>>>>>>> >>>>>>>>>>> -Jack >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Feb 19, 2024 at 11:32 AM Daniel Weeks <dwe...@apache.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Jack, >>>>>>>>>>>> >>>>>>>>>>>> I think we should consider either allowing the storage table to >>>>>>>>>>>> be fully exposed/addressable via the catalog or allow access via >>>>>>>>>>>> namespacing like with metadata tables. E.g. >>>>>>>>>>>> <catalog>.<database>.<table>.<storage>, which would allow for full >>>>>>>>>>>> access >>>>>>>>>>>> to the underlying table. >>>>>>>>>>>> >>>>>>>>>>>> For other engines to interact with the storage table (e.g. to >>>>>>>>>>>> execute the query to materialize the table), it may be necessary >>>>>>>>>>>> that the >>>>>>>>>>>> table is fully addressable. Whether the storage table is returned >>>>>>>>>>>> as part >>>>>>>>>>>> of list operations may be something we leave up to the catalog >>>>>>>>>>>> implementation. >>>>>>>>>>>> >>>>>>>>>>>> I don't think the table should reference a physical location >>>>>>>>>>>> (only a logical reference) since things will be changing behind >>>>>>>>>>>> the view >>>>>>>>>>>> definition and I'm not confident we want to have to update the view >>>>>>>>>>>> representation everytime the storage table is updated. >>>>>>>>>>>> >>>>>>>>>>>> I think there's still some exploration as to whether we need to >>>>>>>>>>>> model this as separate from view endpoints, but there may be >>>>>>>>>>>> enough overlap >>>>>>>>>>>> that it's not necessary to have yet another set of endpoints for >>>>>>>>>>>> materialized views (maybe filter params if you need to >>>>>>>>>>>> distinguish?). >>>>>>>>>>>> >>>>>>>>>>>> -Dan >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Feb 18, 2024 at 6:57 PM Renjie Liu < >>>>>>>>>>>> liurenjie2...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, Jack: >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for raising this. >>>>>>>>>>>>> >>>>>>>>>>>>> In most database systems, MV, view and table are considered >>>>>>>>>>>>>> independent objects, at least at API level. It is very rare for >>>>>>>>>>>>>> a system to >>>>>>>>>>>>>> support operations like "materializing a logical view" or >>>>>>>>>>>>>> "upgrading a >>>>>>>>>>>>>> logical view to MV", because view and MV are very different in >>>>>>>>>>>>>> almost every >>>>>>>>>>>>>> aspect of user experience. Extending the existing view or table >>>>>>>>>>>>>> spec to >>>>>>>>>>>>>> accommodate MV might give us a MV implementation similar to the >>>>>>>>>>>>>> current >>>>>>>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, >>>>>>>>>>>>>> but it >>>>>>>>>>>>>> binds us to a very specific design of MV, which we might regret >>>>>>>>>>>>>> in the >>>>>>>>>>>>>> future. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> When I reviewed the doc, I thought we were discussing the spec >>>>>>>>>>>>> of materialized view, just like the spec of table metadata, but >>>>>>>>>>>>> didn't not >>>>>>>>>>>>> the user facing api. I would definitely agree that we should >>>>>>>>>>>>> consider MV as >>>>>>>>>>>>> another kind of database object in user facing api, even though >>>>>>>>>>>>> it's >>>>>>>>>>>>> internally modelled as a view + storage table pointer. >>>>>>>>>>>>> >>>>>>>>>>>>> If we want to make the REST experience good for MV, I think we >>>>>>>>>>>>>> should at least consider directly describing the full metadata >>>>>>>>>>>>>> of the >>>>>>>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON >>>>>>>>>>>>>> file. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Do you mean we need to add components like >>>>>>>>>>>>> `LoadMaterializedViewResponse`, if so, I would +1 for this. >>>>>>>>>>>>> >>>>>>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with >>>>>>>>>>>>>> MVs?* >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> As I have mentioned above, I think we should consider MV as >>>>>>>>>>>>> another database object, so I think we should add a set of apis >>>>>>>>>>>>> specifically designed for MV, such as `loadMV`, `freshMV`. >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Feb 17, 2024 at 11:14 AM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>> >>>>>>>>>>>>>> As we are discussing the spec change for materialized view, >>>>>>>>>>>>>> there has been a missing aspect that is technically also >>>>>>>>>>>>>> related, and might >>>>>>>>>>>>>> affect the MV spec design: *how do we want to add MV support >>>>>>>>>>>>>> to the REST spec?* >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would like to discuss this in a new thread to collect >>>>>>>>>>>>>> people's thoughts. This topic expands to the following 2 >>>>>>>>>>>>>> sub-questions: >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Q1: how would the MV spec change affect the REST spec?* >>>>>>>>>>>>>> In the current proposal, it looks like we are using a design >>>>>>>>>>>>>> where a MV is modeled as an Iceberg view linking to an Iceberg >>>>>>>>>>>>>> storage >>>>>>>>>>>>>> table. At the same time, we do not want to expose this storage >>>>>>>>>>>>>> table in the >>>>>>>>>>>>>> catalog, thus the Iceberg view has a pointer to only a metadata >>>>>>>>>>>>>> JSON file >>>>>>>>>>>>>> of the Iceberg storage table. Each MV refresh updates the >>>>>>>>>>>>>> pointer to a new >>>>>>>>>>>>>> metadata JSON file. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I feel this does not play very well with the direction that >>>>>>>>>>>>>> REST is going. The REST catalog is trying to remove the >>>>>>>>>>>>>> dependency to the >>>>>>>>>>>>>> metadata JSON file. For example, in LoadTableResponse the only >>>>>>>>>>>>>> required >>>>>>>>>>>>>> field is the metadata, and metadata-location is actually >>>>>>>>>>>>>> optional. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If we want to make the REST experience good for MV, I think >>>>>>>>>>>>>> we should at least consider directly describing the full >>>>>>>>>>>>>> metadata of the >>>>>>>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON >>>>>>>>>>>>>> file. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with >>>>>>>>>>>>>> MVs?* >>>>>>>>>>>>>> So far we have been thinking about amending the view spec to >>>>>>>>>>>>>> accommodate MV. This entails likely having MVs also being >>>>>>>>>>>>>> handled through >>>>>>>>>>>>>> the view APIs in REST spec. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We need to agree with that first in the community, because >>>>>>>>>>>>>> this has various implications, and I am not really sure at this >>>>>>>>>>>>>> point if it >>>>>>>>>>>>>> is the best way to go. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If MV interactions are through the view APIs, the view APIs >>>>>>>>>>>>>> need to be updated to accommodate MV constructs that are not >>>>>>>>>>>>>> really related >>>>>>>>>>>>>> to logical views. In fact, most actions performed on MVs are >>>>>>>>>>>>>> more similar >>>>>>>>>>>>>> to actions performed on table rather than view, which involve >>>>>>>>>>>>>> configuring >>>>>>>>>>>>>> data layout, read and write constructs. For example, users might >>>>>>>>>>>>>> run >>>>>>>>>>>>>> something like: >>>>>>>>>>>>>> >>>>>>>>>>>>>> CREATE MATERIALIZED VIEW mv >>>>>>>>>>>>>> PARTITION BY col1 >>>>>>>>>>>>>> CLUSTER BY col2 >>>>>>>>>>>>>> AS ( // some sql ) >>>>>>>>>>>>>> >>>>>>>>>>>>>> then the CreateView API needs to accept partition spec and >>>>>>>>>>>>>> sort order that are completely not relevant for logical views. >>>>>>>>>>>>>> >>>>>>>>>>>>>> When reading a MV, we might even want to have a >>>>>>>>>>>>>> PlanMaterializedView API similar to the PlanTable API we are >>>>>>>>>>>>>> adding. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *My personal take* >>>>>>>>>>>>>> It feels like we need to reconsider the question of what is >>>>>>>>>>>>>> the best way to model MV in Iceberg. Should it be (1) a view >>>>>>>>>>>>>> linked to a >>>>>>>>>>>>>> storage table, or (2) a table with a view SQL associated with >>>>>>>>>>>>>> it, or (3) >>>>>>>>>>>>>> it's a completely independent thing. This topic was discussed in >>>>>>>>>>>>>> the past in >>>>>>>>>>>>>> this doc >>>>>>>>>>>>>> <https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/edit?pli=1>, >>>>>>>>>>>>>> but at that time we did not have much perspective about aspects >>>>>>>>>>>>>> like REST >>>>>>>>>>>>>> spec, and the view integration was also not fully completed yet. >>>>>>>>>>>>>> With the >>>>>>>>>>>>>> new knowledge, currently I am actually leaning a bit more >>>>>>>>>>>>>> towards (3). >>>>>>>>>>>>>> >>>>>>>>>>>>>> In most database systems, MV, view and table are considered >>>>>>>>>>>>>> independent objects, at least at API level. It is very rare for >>>>>>>>>>>>>> a system to >>>>>>>>>>>>>> support operations like "materializing a logical view" or >>>>>>>>>>>>>> "upgrading a >>>>>>>>>>>>>> logical view to MV", because view and MV are very different in >>>>>>>>>>>>>> almost every >>>>>>>>>>>>>> aspect of user experience. Extending the existing view or table >>>>>>>>>>>>>> spec to >>>>>>>>>>>>>> accommodate MV might give us a MV implementation similar to the >>>>>>>>>>>>>> current >>>>>>>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, >>>>>>>>>>>>>> but it >>>>>>>>>>>>>> binds us to a very specific design of MV, which we might regret >>>>>>>>>>>>>> in the >>>>>>>>>>>>>> future. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If we make a new MV spec, it can be made up of fields that >>>>>>>>>>>>>> already exist in the table and view specs, but it is a whole new >>>>>>>>>>>>>> spec. In >>>>>>>>>>>>>> this way, the spec can evolve independently to accommodate MV >>>>>>>>>>>>>> specific >>>>>>>>>>>>>> features, and we can also create MV-related REST endpoints that >>>>>>>>>>>>>> will evolve >>>>>>>>>>>>>> independently from table and view REST APIs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> But on the other side it is definitely associated with more >>>>>>>>>>>>>> work to maintain a new spec, and potentially big refactoring in >>>>>>>>>>>>>> the >>>>>>>>>>>>>> codebase to make sure operations today that work on table or >>>>>>>>>>>>>> view can now >>>>>>>>>>>>>> support MV as a different object. And it definitely has other >>>>>>>>>>>>>> problems that >>>>>>>>>>>>>> I have overlooked. I would greatly appreciate any thoughts about >>>>>>>>>>>>>> this! >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>> >>>>>>>>>>>>>>