Thanks Jan. +1 on having just one thread per question for vote/preference. Where do you suggest we have it, on the discussion question itself? It would be to keep the existing threads and move it there.
Also, I think it makes sense with making a slack channel (for quick question, reply) , and also discuss unresolved questions in the next week's sync or a separate meeting. On Wed, Feb 21, 2024 at 12:40 AM Jan Kaul <jank...@mailbox.org.invalid> wrote: > Thank you Jack for driving the consensus for the MV spec and thank you all > for the discussion. > > I really like the idea about incremental consensus because we often loose > sight in detailed discussions. As Jack mentioned, the highest priority > question currently is: > > *Should the Iceberg MV be realized as a view + storage table or do we > define a new metadata format? *To have one place for the discussion, I > created another Question ( > https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.y70rtfhi9qxi) > to the Materialized View Spec google document. > > To improve the visibility of the arguments I would like to propose a new > process. It would be great if all relevant information is stored in the > document itself. Therefore I would suggest to use the comment threads for > smaller, temporary discussions which can be resolved by adding the points > to the main document. Please close the threads if the information was added > to the document. Additionally, I gave you all permissions to edit the > documents, so you can add missing points yourselves. > > Of course we also need threads that express our preferences (voting). I > would suggest to keep these separate from discussions about single points > so that they can be persisted in the document. > > After a phase of collecting arguments for the different designs I think it > would make sense to have video call to have a face to face discussion. > > What do you think? > > Best wishes, > > Jan > On 20.02.24 21:32, Manish Malhotra wrote: > > Very excited for MV to be in Iceberg :) > Keeping in the same doc. would be helpful, to have the trail. > But also agreed, if there are too many directions/threads, then keep > closing the old one, if there are no more questions. > And put down the assumptions for the initial version to move forward. > > > On Tue, Feb 20, 2024 at 12:17 PM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >> I would vote to keep a log in the doc with open questions, and keep the >> doc updated with open questions as they arise/get resolved. >> >> On Tue, Feb 20, 2024 at 11:37 AM Jack Ye <yezhao...@gmail.com> wrote: >> >>> Thanks for the response from everyone! >>> >>> Before proceeding further, I see a few people referring back to the >>> current design from Jan. I specifically raised this thread based on the >>> information in the doc and a few latest discussions we had there. Because >>> there are many threads in the doc, and each thread points further to other >>> discussion threads in the same doc or other doc, it is now quite hard to >>> follow and continue discussing all different topics there. >>> >>> I hope we can make incremental consensus of the questions in the doc >>> through devlist, because it provides more visibility, and also a single >>> thread instead of multiple threads going on at the same time. If we think >>> this format is not effective, I propose that we create a new mv channel in >>> Iceberg Slack workspace, and people interested can join and discuss all >>> these points directly. What do we think? >>> >>> Best, >>> Jack Ye >>> >>> >>> >>> On Mon, Feb 19, 2024 at 6:03 PM Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Great to see more discussion on the MV spec. Actually, Jan's document >>>> "Iceberg >>>> Materialized View Spec" >>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A> >>>> has >>>> been organized , with a "Design Questions" section to track these debates, >>>> and it would be nice to centralize the debates there, as Micah mentions. >>>> >>>> For Dan's question, I think this debate was tracked in "Design Question >>>> 3: Should the storage table be registered in the catalog?". I think the >>>> general idea there was to not expose it directly via Catalog as it is then >>>> exposed to user modification. If the engine wants to access anything about >>>> the storage table (including audit and storage), it is of course there via >>>> the storage table pointer. I think Walaa's point is also good, we could >>>> expose it as we expose metadata tables, but I am still not sure if there is >>>> still some use-cases of engine access not covered? >>>> >>>> It is true that for Jack's initial question (Do we really want to go >>>> with the MV = view + storage table design approach for Iceberg MV?), >>>> unfortunately we did not capture it as a "Design Question" in Jan's doc, as >>>> it was an implicit assumption of 'yes', because it is the choice of Hive, >>>> Trino, and other engines , as others have pointed out. >>>> >>>> Jack's point about potential evolution of MV (like to add partitioning) >>>> is an interesting one, but definitely hard to grasp. I think it makes >>>> sense to add this as a separate Design Question in the doc, and add the >>>> options. This will allow us to flesh out this alternative >>>> option(s). Maybe Micah's point about modifying existing proposal to >>>> 'embed' the required table metadata fields in the existing view metadata, >>>> is one middle ground option. Or we add a totally new MV object spec for >>>> MV, separate than existing View spec? >>>> >>>> Also , as Jack pointed out, it may make sense to have the REST / >>>> Catalog API proposal in the doc to educate the above decision. >>>> >>>> Thanks >>>> Szehon >>>> >>>> On Mon, Feb 19, 2024 at 4:08 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> I think it would help if we answer the question of whether an MV is a >>>>> view + storage table (and degree of exposing this underlying >>>>> implementation) in the context of the user interfacing with those >>>>> concepts: >>>>> >>>>> For the end user, interfacing with the engine APIs (e.g., through >>>>> SQL), materialized view APIs should be almost the same as regular view >>>>> APIs >>>>> (except for operations specific to materialized views like REFRESH command >>>>> etc). Typically, the end user interacts with the (materialized) view >>>>> object >>>>> as a view, and the engine performs the abstraction over the storage table. >>>>> >>>>> For the engines interfacing with Iceberg, it sounds the correct >>>>> abstraction at this layer is indeed view + storage table, and engines >>>>> could >>>>> have access to both objects to optimize queries. >>>>> >>>>> So in a sense, the engine will ultimately hide most of the >>>>> storage detail from the end user (except for advanced users who want to >>>>> explicitly access the storage table with a modifier like >>>>> "db.view.storageTable" -- and they can only read it), while Iceberg will >>>>> expose the storage details to the engine catalog to use it in scans if >>>>> needed. So the storage table is hidden or exposed based on the context/the >>>>> actual users. From Iceberg point of view (which interacts with the >>>>> engines), the storage table is exposed. Note that this does not >>>>> necessarily mean that the storage table is registered in the catalog with >>>>> its own independent name (e.g., where we can drop the view but keep the >>>>> storage table and access it from the catalog). Addressing the storage >>>>> table >>>>> using a virtual namespace like "db.view.storageTable" sounds like a good >>>>> middle ground. Anyways, end users should not need to directly access the >>>>> storage table in most cases. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> On Mon, Feb 19, 2024 at 3:38 PM Micah Kornfield <emkornfi...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Jack, >>>>>> >>>>>> >>>>>>> In my mind, the first key point we all need to agree upon to move >>>>>>> this design forward is*: Do we really want to go with the MV = view >>>>>>> + storage table design approach for Iceberg MV?* >>>>>> >>>>>> >>>>>> I think we want this to the extent that we do not want to redefine >>>>>> the same concept with different representations/naming to the greatest >>>>>> degree possible. This is why borrowing the concepts from the view (e.g. >>>>>> multiple ways of expressing the same view logic in different dialects) >>>>>> and >>>>>> aspects of the materialized data (e.g. partitioning, ordering) feels most >>>>>> natural. IIUC your proposal, I think you are saying maybe two >>>>>> modifications to the existing proposals in the document: >>>>>> >>>>>> 1. No separate storage table link, instead embed most of the >>>>>> metadata of the materialized table into the MV document (the exception >>>>>> seems to be snapshot history) >>>>>> 2. For snapshot history, have one unified history specific to the MV. >>>>>> >>>>>> This seems fairly reasonable to me and I think I can solve some >>>>>> challenges with the existing proposal in an elegant way. If this is >>>>>> correct (or maybe if it isn't quite correct) perhaps you can make >>>>>> suggestions to the document so all of the trade-offs can be discussed in >>>>>> one place? >>>>>> >>>>>> I think the one thing the current draft of the materialized view >>>>>> ignores is how to store algebraic summaries (e.g. separate sum and count >>>>>> for averages, or other sketches), so that new data can be incrementally >>>>>> incorporated. But representing these structures feels like it >>>>>> potentially >>>>>> has value beyond just MVs (e.g. it can be a natural way to express >>>>>> summary >>>>>> statistics in table metadata), so I think it deserves at least a try in >>>>>> incorporating the concepts in the table specification, so the definitions >>>>>> can be shared. I was imagining this could come as part of the next >>>>>> revision of MV specification. >>>>>> >>>>>> The MV internal structure could evolve in a way that works more >>>>>>> efficiently with the reduced scope of functionalities, without relying >>>>>>> on >>>>>>> table to offer the same capabilities. I can at least say that is true >>>>>>> based >>>>>>> on my internal knowledge of how Redshift MVs work. >>>>>> >>>>>> >>>>>> I'm not sure I fully understand this point, but it seems the main >>>>>> question here is what would break if it started to evolve in this >>>>>> direction. Is it purely additive or do we suspect some elements would >>>>>> need >>>>>> to be removed? My gut feeling here is the main concerns here are >>>>>> getting >>>>>> the cardinatities correct (i.e. 1 MV should probably have 0, 1 or more >>>>>> materialized storage tables associated with it, to support more advanced >>>>>> algebraic structures listed above, and perhaps a second without them, and >>>>>> additional metadata to distinguish between these two different modes). >>>>>> >>>>>> If after the evaluation, we are confident that the MV = view + >>>>>>> storage table approach is the right way to go, then we can debate the >>>>>>> other >>>>>>> issues, and I think the next issue to reach consensus should be "Should >>>>>>> the >>>>>>> storage table be registered in the catalog?". >>>>>> >>>>>> >>>>>> I actually think there are actually more fundamental questions posed: >>>>>> 1. Should be considering how items should be modelled in the REST >>>>>> API concurrently with the Iceberg spec, as that potentially impacts >>>>>> design >>>>>> decision (I think the answer is yes, and we should update the doc with >>>>>> sketches on new endpoints and operations on the endpoints to ensure >>>>>> things >>>>>> align). >>>>>> 2. Going forward should new aspects of Iceberg artifacts rely on the >>>>>> fact that a catalog is present and we can rely on a naming convention for >>>>>> looking up other artifacts in a catalog as pointers (I lean yes on this, >>>>>> but I'm a little bit more ambivalent). >>>>>> >>>>>> Thanks, >>>>>> Micah >>>>>> >>>>>> On Mon, Feb 19, 2024 at 12:52 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> I suggest we need a step-by-step process to make incremental >>>>>>> consensus, otherwise we are constantly talking about many different >>>>>>> debates >>>>>>> at the same time. >>>>>>> >>>>>>> In my mind, the first key point we all need to agree upon to move >>>>>>> this design forward is*: Do we really want to go with the MV = view >>>>>>> + storage table design approach for Iceberg MV?* >>>>>>> >>>>>>> I think we (at least me) started with this assumption, mostly >>>>>>> because this is how Trino implements MV, and how Hive tables store MV >>>>>>> information today. But does it mean we should design it that way in >>>>>>> Iceberg? >>>>>>> >>>>>>> Now I look back at how we did the view spec design, we could also >>>>>>> say that we just add a representation field in the table spec to store >>>>>>> view, and an Iceberg view is just a table with no data but with >>>>>>> representations defined. But we did not do that. So it feels now quite >>>>>>> inconsistent to say we want to just add a few fields in the table and >>>>>>> view >>>>>>> spec to call it an Iceberg MV. >>>>>>> >>>>>>> If we look into most of the other database systems (e.g. Redshift, >>>>>>> BigQuery, Snowflake), they never expose such implementation details like >>>>>>> storage table. Apart from being close-sourced systems, I think it is >>>>>>> also >>>>>>> for good technical reasons. There are many more things that a table >>>>>>> needs >>>>>>> to support, but does not really apply to MV. The MV internal structure >>>>>>> could evolve in a way that works more efficiently with the reduced >>>>>>> scope of >>>>>>> functionalities, without relying on table to offer the same >>>>>>> capabilities. I >>>>>>> can at least say that is true based on my internal knowledge of how >>>>>>> Redshift MVs work. >>>>>>> >>>>>>> I think we should fully evaluate both directions, and commit to one >>>>>>> first before debating more things. >>>>>>> >>>>>>> If we have a new and independent Iceberg MV spec, then an Iceberg MV >>>>>>> is under-the-hood a single object containing all MV information. It has >>>>>>> its >>>>>>> own name, snapshots, view representation, etc. I don't believe we will >>>>>>> be >>>>>>> blocked by Trino due to its MV SPIs currently requiring the existence >>>>>>> of a >>>>>>> storage table, as it will just be a different implementation from the >>>>>>> existing one in Trino-Iceberg. In this direction, I don't think we need >>>>>>> to >>>>>>> have any further debate about pointers, metadata locations, storage >>>>>>> table, >>>>>>> etc. because everything will be new. >>>>>>> >>>>>>> If after the evaluation, we are confident that the MV = view + >>>>>>> storage table approach is the right way to go, then we can debate the >>>>>>> other >>>>>>> issues, and I think the next issue to reach consensus should be "Should >>>>>>> the >>>>>>> storage table be registered in the catalog?". >>>>>>> >>>>>>> What do we think? >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 19, 2024 at 11:32 AM Daniel Weeks <dwe...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Jack, >>>>>>>> >>>>>>>> I think we should consider either allowing the storage table to be >>>>>>>> fully exposed/addressable via the catalog or allow access via >>>>>>>> namespacing >>>>>>>> like with metadata tables. E.g. >>>>>>>> <catalog>.<database>.<table>.<storage>, >>>>>>>> which would allow for full access to the underlying table. >>>>>>>> >>>>>>>> For other engines to interact with the storage table (e.g. to >>>>>>>> execute the query to materialize the table), it may be necessary that >>>>>>>> the >>>>>>>> table is fully addressable. Whether the storage table is returned as >>>>>>>> part >>>>>>>> of list operations may be something we leave up to the catalog >>>>>>>> implementation. >>>>>>>> >>>>>>>> I don't think the table should reference a physical location (only >>>>>>>> a logical reference) since things will be changing behind the view >>>>>>>> definition and I'm not confident we want to have to update the view >>>>>>>> representation everytime the storage table is updated. >>>>>>>> >>>>>>>> I think there's still some exploration as to whether we need to >>>>>>>> model this as separate from view endpoints, but there may be enough >>>>>>>> overlap >>>>>>>> that it's not necessary to have yet another set of endpoints for >>>>>>>> materialized views (maybe filter params if you need to distinguish?). >>>>>>>> >>>>>>>> -Dan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Feb 18, 2024 at 6:57 PM Renjie Liu <liurenjie2...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, Jack: >>>>>>>>> >>>>>>>>> Thanks for raising this. >>>>>>>>> >>>>>>>>> In most database systems, MV, view and table are considered >>>>>>>>>> independent objects, at least at API level. It is very rare for a >>>>>>>>>> system to >>>>>>>>>> support operations like "materializing a logical view" or "upgrading >>>>>>>>>> a >>>>>>>>>> logical view to MV", because view and MV are very different in >>>>>>>>>> almost every >>>>>>>>>> aspect of user experience. Extending the existing view or table spec >>>>>>>>>> to >>>>>>>>>> accommodate MV might give us a MV implementation similar to the >>>>>>>>>> current >>>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, but >>>>>>>>>> it >>>>>>>>>> binds us to a very specific design of MV, which we might regret in >>>>>>>>>> the >>>>>>>>>> future. >>>>>>>>> >>>>>>>>> >>>>>>>>> When I reviewed the doc, I thought we were discussing the spec of >>>>>>>>> materialized view, just like the spec of table metadata, but didn't >>>>>>>>> not the >>>>>>>>> user facing api. I would definitely agree that we should consider MV >>>>>>>>> as >>>>>>>>> another kind of database object in user facing api, even though it's >>>>>>>>> internally modelled as a view + storage table pointer. >>>>>>>>> >>>>>>>>> If we want to make the REST experience good for MV, I think we >>>>>>>>>> should at least consider directly describing the full metadata of the >>>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON file. >>>>>>>>> >>>>>>>>> >>>>>>>>> Do you mean we need to add components like >>>>>>>>> `LoadMaterializedViewResponse`, if so, I would +1 for this. >>>>>>>>> >>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with MVs?* >>>>>>>>> >>>>>>>>> >>>>>>>>> As I have mentioned above, I think we should consider MV as >>>>>>>>> another database object, so I think we should add a set of apis >>>>>>>>> specifically designed for MV, such as `loadMV`, `freshMV`. >>>>>>>>> >>>>>>>>> On Sat, Feb 17, 2024 at 11:14 AM Jack Ye <yezhao...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> As we are discussing the spec change for materialized view, there >>>>>>>>>> has been a missing aspect that is technically also related, and might >>>>>>>>>> affect the MV spec design: *how do we want to add MV support to >>>>>>>>>> the REST spec?* >>>>>>>>>> >>>>>>>>>> I would like to discuss this in a new thread to collect people's >>>>>>>>>> thoughts. This topic expands to the following 2 sub-questions: >>>>>>>>>> >>>>>>>>>> *Q1: how would the MV spec change affect the REST spec?* >>>>>>>>>> In the current proposal, it looks like we are using a design >>>>>>>>>> where a MV is modeled as an Iceberg view linking to an Iceberg >>>>>>>>>> storage >>>>>>>>>> table. At the same time, we do not want to expose this storage table >>>>>>>>>> in the >>>>>>>>>> catalog, thus the Iceberg view has a pointer to only a metadata JSON >>>>>>>>>> file >>>>>>>>>> of the Iceberg storage table. Each MV refresh updates the pointer to >>>>>>>>>> a new >>>>>>>>>> metadata JSON file. >>>>>>>>>> >>>>>>>>>> I feel this does not play very well with the direction that REST >>>>>>>>>> is going. The REST catalog is trying to remove the dependency to the >>>>>>>>>> metadata JSON file. For example, in LoadTableResponse the only >>>>>>>>>> required >>>>>>>>>> field is the metadata, and metadata-location is actually optional. >>>>>>>>>> >>>>>>>>>> If we want to make the REST experience good for MV, I think we >>>>>>>>>> should at least consider directly describing the full metadata of the >>>>>>>>>> storage table in Iceberg view, instead of pointing to a JSON file. >>>>>>>>>> >>>>>>>>>> *Q2: what REST APIs do we expect to use for interactions with >>>>>>>>>> MVs?* >>>>>>>>>> So far we have been thinking about amending the view spec to >>>>>>>>>> accommodate MV. This entails likely having MVs also being handled >>>>>>>>>> through >>>>>>>>>> the view APIs in REST spec. >>>>>>>>>> >>>>>>>>>> We need to agree with that first in the community, because this >>>>>>>>>> has various implications, and I am not really sure at this point if >>>>>>>>>> it is >>>>>>>>>> the best way to go. >>>>>>>>>> >>>>>>>>>> If MV interactions are through the view APIs, the view APIs need >>>>>>>>>> to be updated to accommodate MV constructs that are not really >>>>>>>>>> related to >>>>>>>>>> logical views. In fact, most actions performed on MVs are more >>>>>>>>>> similar to >>>>>>>>>> actions performed on table rather than view, which involve >>>>>>>>>> configuring data >>>>>>>>>> layout, read and write constructs. For example, users might run >>>>>>>>>> something >>>>>>>>>> like: >>>>>>>>>> >>>>>>>>>> CREATE MATERIALIZED VIEW mv >>>>>>>>>> PARTITION BY col1 >>>>>>>>>> CLUSTER BY col2 >>>>>>>>>> AS ( // some sql ) >>>>>>>>>> >>>>>>>>>> then the CreateView API needs to accept partition spec and sort >>>>>>>>>> order that are completely not relevant for logical views. >>>>>>>>>> >>>>>>>>>> When reading a MV, we might even want to have a >>>>>>>>>> PlanMaterializedView API similar to the PlanTable API we are adding. >>>>>>>>>> >>>>>>>>>> *My personal take* >>>>>>>>>> It feels like we need to reconsider the question of what is the >>>>>>>>>> best way to model MV in Iceberg. Should it be (1) a view linked to a >>>>>>>>>> storage table, or (2) a table with a view SQL associated with it, or >>>>>>>>>> (3) >>>>>>>>>> it's a completely independent thing. This topic was discussed in the >>>>>>>>>> past in >>>>>>>>>> this doc >>>>>>>>>> <https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/edit?pli=1>, >>>>>>>>>> but at that time we did not have much perspective about aspects like >>>>>>>>>> REST >>>>>>>>>> spec, and the view integration was also not fully completed yet. >>>>>>>>>> With the >>>>>>>>>> new knowledge, currently I am actually leaning a bit more towards >>>>>>>>>> (3). >>>>>>>>>> >>>>>>>>>> In most database systems, MV, view and table are considered >>>>>>>>>> independent objects, at least at API level. It is very rare for a >>>>>>>>>> system to >>>>>>>>>> support operations like "materializing a logical view" or "upgrading >>>>>>>>>> a >>>>>>>>>> logical view to MV", because view and MV are very different in >>>>>>>>>> almost every >>>>>>>>>> aspect of user experience. Extending the existing view or table spec >>>>>>>>>> to >>>>>>>>>> accommodate MV might give us a MV implementation similar to the >>>>>>>>>> current >>>>>>>>>> Trino or Hive views, save us some effort and a few APIs in REST, but >>>>>>>>>> it >>>>>>>>>> binds us to a very specific design of MV, which we might regret in >>>>>>>>>> the >>>>>>>>>> future. >>>>>>>>>> >>>>>>>>>> If we make a new MV spec, it can be made up of fields that >>>>>>>>>> already exist in the table and view specs, but it is a whole new >>>>>>>>>> spec. In >>>>>>>>>> this way, the spec can evolve independently to accommodate MV >>>>>>>>>> specific >>>>>>>>>> features, and we can also create MV-related REST endpoints that will >>>>>>>>>> evolve >>>>>>>>>> independently from table and view REST APIs. >>>>>>>>>> >>>>>>>>>> But on the other side it is definitely associated with more work >>>>>>>>>> to maintain a new spec, and potentially big refactoring in the >>>>>>>>>> codebase to >>>>>>>>>> make sure operations today that work on table or view can now >>>>>>>>>> support MV as >>>>>>>>>> a different object. And it definitely has other problems that I have >>>>>>>>>> overlooked. I would greatly appreciate any thoughts about this! >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Jack Ye >>>>>>>>>> >>>>>>>>>>