Re: Materialized view integration with REST spec

Jan Kaul Tue, 26 Mar 2024 06:27:16 -0700

I've added a description to the "Combined metadata" Option of Walaa'sdocument. I'm also adding it here:

This option treats the underlying view and storage table as a combinedcatalog object. The operation of this combined approach can be bestdemonstrated by looking at the different layers of the Icebergimplementation. In the top layer is the Iceberg *library* that interactswith a particular Iceberg *catalog*. The catalog handles the access tothe metadata *storage*.This option uses a combined storage object to store view and tablemetadata related to the materialized view. To avoid the definition of anentirely new metadata format, the storage object is composed of the viewand table metadata. Additionally the combined storage object has a*single identifier* in the catalogs. The Iceberg library treats thematerialized view as a separate view and a storage table object, it isonly at the catalog and storage layer that the materialized view istreated as a single entity.To reuse most of the existing TableCatalog, ViewCatalog and theiroperations, the table and view catalog can be thought of as “filters”(lenses <https://medium.com/javascript-scene/lenses-b85976cb0534>), thatallow the interaction only with the corresponding part of the MV storageobject. Performing a “CommitView” operation on the view catalog willonly affect the view metadata part of the combined MV storage object.And similarly, performing a “CommitTable” operation on the table catalogwill only affect the table metadata part of the combined MV storageobject. Both catalogs use the same identifier for operations on thematerialized view.The creation of a materialized view is done with the “createView”operation (with additional materialization flag) on the view catalog,creating a combined MV storage object with an empty storage table.One could entirely reuse the existing API for loading the materializedview metadata as follows. When calling the “loadView” method of theViewCatalog, the catalog implementation fetches and caches the entire MVmetadata object in process and returns the view metadata part. When the“loadTable” method of the TableCatalog is then called to obtain thestorage table, it returns the table part of the cached MV metadata object.


Best wishes,

Jan

On 3/26/24 9:08 AM, Jan Kaul wrote:

I think it makes sense if I use the "Description" section of yourdocument to clarify how I imagine a combined MV solution to look like.This would simplify the discussion about pros and cons, because we canreference or extend the description. I will try to find the time latertoday.


Thanks,

Jan

On 3/25/24 4:39 PM, Walaa Eldin Moustafa wrote:

Thanks Jan! I am not sure if you would like to make suggestions torevise the options themselves or the current options pros and cons.In either case, as mentioned earlier, we can do that on the doc andonce we agree on the options and their pros and cons we can moveforward. How does that sound?


Thanks,
Walaa.

On Mon, Mar 25, 2024 at 7:45 AM Jan Kaul<jank...@mailbox.org.invalid> wrote:


    I have the feeling that the current pros and cons from the
    summary target a version of the MV spec that wasn't really part
    of the discussion. The current arguments target a completely new
    specification for materialized views which we agreed on, is out
    of scope. Instead of a completely new specification the argument
    was made for a MV metadata object that embeds the View and the
    Table metadata, which was Option 6
    
<https://docs.google.com/spreadsheets/d/1a0tlyh8f2ft2SepE7H3bgoY2A0q5IILgzAsJMnwjTBs/edit#gid=0&range=G3>
    in Jack's summary document. With that approach the "commitView"
    and "commitTable" operations don't have to be changed and only
    the "loadView" operation has to be adopted. Additionally,
    compaction and snapshot expiration can be reused for the embedded
    solution. With that in mind, the cons 2, 4, 5, 6 from the summary
    don't really apply.

    Furthermore, I think we should distinguish between pros and cons
    for the implementers and the users. Because most of the pros (no
    new operations) for separate objects (option1) are for the
    implementers and most of the pros (single logical object, doesn't
    require 2 loads) for combined objects (option3) are for the
    users. In my opinion, in the long run the design decisions should
    be focused more on the user preferences than the implementers.
    On 3/25/24 14:49, Benny Chow wrote:

Hi Manu

This is Walaa's Spark implementation for option 1:

https://github.com/apache/iceberg/pull/9830/files/a9e1bee3b5bf5914e5330d3b195042aea33868c9

There's no code for option 2 yet.

Best
Benny

On Mon, Mar 25, 2024 at 12:37 AM Manu Zhang
<owenzhang1...@gmail.com> wrote:

Thanks Walaa for the summary. It's unclear to me which are
the reference implementation for option 1 and reference MV
spec for option 2 from the context. I can find some links in
the References section but not sure which should be referred
to respectively.

On Mon, Mar 25, 2024 at 3:38 AM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:

Thanks Himadri for the questions. At this point, our
objective is to have a common understanding of both
options and their pros and cons. The best way to achieve
this is to iterate on the doc to discuss the details of
each option or their pros and cons. We can always add
more details or update the pros and cons. The main thing
is to keep the options to two so that we keep the scope
manageable.

Once we have a common understanding, it will be easy to
make a choice and move forward. Therefore, I would
suggest reframing your questions as either adding
suggestions to add more details to the options,
questions on how either works, or discussions of their
pros and cons on the doc.

Thanks,
Walaa.

Re: Materialized view integration with REST spec

Reply via email to