It sounds good. I would also propose to use the "proposal process": creating a github issue with the "proposal" tag and link the document there in a comment.
Regards JB On Tue, Mar 26, 2024 at 3:05 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > > Thanks Jan! To avoid spreading discussions on multiple places, I will > continue the comments on the doc. Also it is easier to run into communication > gaps in email threads since effectively we have one thread, but in docs we > have many. > > Thanks, > Walaa. > > On Tue, Mar 26, 2024 at 6:27 AM Jan Kaul <jank...@mailbox.org.invalid> wrote: >> >> I've added a description to the "Combined metadata" Option of Walaa's >> document. I'm also adding it here: >> >> This option treats the underlying view and storage table as a combined >> catalog object. The operation of this combined approach can be best >> demonstrated by looking at the different layers of the Iceberg >> implementation. In the top layer is the Iceberg library that interacts with >> a particular Iceberg catalog. The catalog handles the access to the metadata >> storage. >> This option uses a combined storage object to store view and table metadata >> related to the materialized view. To avoid the definition of an entirely new >> metadata format, the storage object is composed of the view and table >> metadata. Additionally the combined storage object has a single identifier >> in the catalogs. The Iceberg library treats the materialized view as a >> separate view and a storage table object, it is only at the catalog and >> storage layer that the materialized view is treated as a single entity. >> To reuse most of the existing TableCatalog, ViewCatalog and their >> operations, the table and view catalog can be thought of as “filters” >> (lenses), that allow the interaction only with the corresponding part of the >> MV storage object. Performing a “CommitView” operation on the view catalog >> will only affect the view metadata part of the combined MV storage object. >> And similarly, performing a “CommitTable” operation on the table catalog >> will only affect the table metadata part of the combined MV storage object. >> Both catalogs use the same identifier for operations on the materialized >> view. >> The creation of a materialized view is done with the “createView” operation >> (with additional materialization flag) on the view catalog, creating a >> combined MV storage object with an empty storage table. >> One could entirely reuse the existing API for loading the materialized view >> metadata as follows. When calling the “loadView” method of the ViewCatalog, >> the catalog implementation fetches and caches the entire MV metadata object >> in process and returns the view metadata part. When the “loadTable” method >> of the TableCatalog is then called to obtain the storage table, it returns >> the table part of the cached MV metadata object. >> >> Best wishes, >> >> Jan >> >> On 3/26/24 9:08 AM, Jan Kaul wrote: >> >> I think it makes sense if I use the "Description" section of your document >> to clarify how I imagine a combined MV solution to look like. This would >> simplify the discussion about pros and cons, because we can reference or >> extend the description. I will try to find the time later today. >> >> Thanks, >> >> Jan >> >> On 3/25/24 4:39 PM, Walaa Eldin Moustafa wrote: >> >> Thanks Jan! I am not sure if you would like to make suggestions to revise >> the options themselves or the current options pros and cons. In either case, >> as mentioned earlier, we can do that on the doc and once we agree on the >> options and their pros and cons we can move forward. How does that sound? >> >> Thanks, >> Walaa. >> >> >> On Mon, Mar 25, 2024 at 7:45 AM Jan Kaul <jank...@mailbox.org.invalid> wrote: >>> >>> I have the feeling that the current pros and cons from the summary target a >>> version of the MV spec that wasn't really part of the discussion. The >>> current arguments target a completely new specification for materialized >>> views which we agreed on, is out of scope. Instead of a completely new >>> specification the argument was made for a MV metadata object that embeds >>> the View and the Table metadata, which was Option 6 in Jack's summary >>> document. With that approach the "commitView" and "commitTable" operations >>> don't have to be changed and only the "loadView" operation has to be >>> adopted. Additionally, compaction and snapshot expiration can be reused for >>> the embedded solution. With that in mind, the cons 2, 4, 5, 6 from the >>> summary don't really apply. >>> >>> Furthermore, I think we should distinguish between pros and cons for the >>> implementers and the users. Because most of the pros (no new operations) >>> for separate objects (option1) are for the implementers and most of the >>> pros (single logical object, doesn't require 2 loads) for combined objects >>> (option3) are for the users. In my opinion, in the long run the design >>> decisions should be focused more on the user preferences than the >>> implementers. >>> On 3/25/24 14:49, Benny Chow wrote: >>> >>> Hi Manu >>> >>> This is Walaa's Spark implementation for option 1: >>> https://github.com/apache/iceberg/pull/9830/files/a9e1bee3b5bf5914e5330d3b195042aea33868c9 >>> There's no code for option 2 yet. >>> >>> Best >>> Benny >>> >>> On Mon, Mar 25, 2024 at 12:37 AM Manu Zhang <owenzhang1...@gmail.com> wrote: >>>> >>>> Thanks Walaa for the summary. It's unclear to me which are the reference >>>> implementation for option 1 and reference MV spec for option 2 from the >>>> context. I can find some links in the References section but not sure >>>> which should be referred to respectively. >>>> >>>> On Mon, Mar 25, 2024 at 3:38 AM Walaa Eldin Moustafa >>>> <wa.moust...@gmail.com> wrote: >>>>> >>>>> Thanks Himadri for the questions. At this point, our objective is to have >>>>> a common understanding of both options and their pros and cons. The best >>>>> way to achieve this is to iterate on the doc to discuss the details of >>>>> each option or their pros and cons. We can always add more details or >>>>> update the pros and cons. The main thing is to keep the options to two so >>>>> that we keep the scope manageable. >>>>> >>>>> Once we have a common understanding, it will be easy to make a choice and >>>>> move forward. Therefore, I would suggest reframing your questions as >>>>> either adding suggestions to add more details to the options, questions >>>>> on how either works, or discussions of their pros and cons on the doc. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>>