Re: [DISCUSS] Iceberg Materialzied Views

Jan Kaul Fri, 13 Sep 2024 02:08:07 -0700

Hi,

regarding our recent discussion on table identifiers with respect todifferent catalog_names with different query engines. We have the sameproblem when we want to reference the storage table from the commonview. *If we include the catalog_name as part of the identifier,different query engines might not be able to load the storage table.

We could enforce that every storage table has to be part of the samecatalog as the main view. This way an identifier without thecatalog_name would be enough to point to the correct storage table.


What are your thoughts on this?

Best wishes,

Jan

On 11.09.24 16:05, Walaa Eldin Moustafa wrote:

I think this type of discussion is exactly what motivates aclarification in the view spec so that we can resolve MV lineage. Willcreate separate thread for view spec clarification.

Following up on Jan’s point, yes I agree in order to support catalogname, it should be at the representation level, but catalog name doesnot really depend on the “dialect” but rather on the “engine”; hencethe discussion becomes a little more involved.


Thanks,
Walaa.

On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul <jank...@mailbox.org.invalid>wrote:


    Hi Benny,

    I think that identifiers only being defined for a certain
    representation is exactly what we want. Each representation can
    define their own identifiers that then map to an UUID. This way
    the "catalog_name" of the identifier for a "Spark" dialect can be
    different then for a "Dremio" dialect.

    The important part is that we still have a list of identifiers for
    each representation that we can use with the catalog to obtain the
    state of the source tables.

    Best wishes,

    Jan

    On 11.09.24 01:33, Benny Chow wrote:

    Hi Walaa, I don't think the current view spec implicitly assumes
    a common catalog name between engines.  I tested this by not
    specifying the default-catalog and both engines could look up the
    correct table under the shared default-namespace even when each
    engine uses a different catalog name.

    Hi Jan, I think the issue with putting the lineage as part of the
    representation is that that identifier only makes sense for that
    representation's engine.  In your example, the catalog aliased as
    "iceberg" in spark is going to have a different name in Dremio or
    Trino.

    IMO, if we are to store a lineage for a view, it should consist
    of something engine agnostic like the table/view UUIDs.  This
    would be stored at the view version level and not the
    representation level.  I think as we get into more of these
    multi-engine, multi-catalog use cases for views, the Iceberg
    Catalog is going to need to do a better job at handling CRUD by
    UUID instead of engine specific identifiers.  Another scenario we
    need to think through is a view that joins tables from two
    different catalogs.  How would we represent the lineage for that
    in an engine agnostic way?

    Thanks
    Benny



    On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul
    <jank...@mailbox.org.invalid>
    <mailto:jank...@mailbox.org.invalid> wrote:

        Thanks Walaa and Benny for clarifying the problem. I think I
        have a better understanding now. Sorry for being a bit
        stubborn before.

        Wouldn't it make sense then to store the lineage as part of
        the representation:

        {

            "type": "sql",

            "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM
        events\nGROUP BY 2",

            "dialect": "spark",

            "lineage": [{

                "identifier": { "catalog": "iceberg", "namespace":
        "public", "table": "events"},

                "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"

            }]

        }

        Best wishes,

        Jan

        On 09.09.24 11:59, Walaa Eldin Moustafa wrote:

        Benny, thank you so much for performing the experiment. Glad
        that using UUIDs as keys in the state map makes more sense now.

        For the issue with the view spec being restrictive, I agree
        and I have raised the concern on the view spec PR last year
        [1]. I think there is some area of improvement here. At the
        least, if it is restrictive, it should be explicitly stated.
        I will start a thread on how to approach the view spec. We
        may need to get more insight on the view spec before
        finalizing the MV spec, because view spec will determine if
        we should proceed with one lineage (with the implicitly
        assumed common catalog name), or with multiple lineages (one
        per engine or catalog name).

        [1]
        https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619

        Thanks,
        Walaa.


        On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com>
        wrote:

            Hi Walaa

            I did some testing with two different engines (Spark and
            Dremio) against the same Nessie catalog and created the
            attached materialized view metadata.json.  I see your
            point now about the SQL identifiers being tightly
            coupled to the engines.  In the metadata JSON, spark
            refers to the catalog as "SparkNessie", whereas Dremio
            refers to the catalog as "LocalNessie".  So, this means
            that the fully qualified view and table identifiers are
            engine specific and Dremio can't lookup a Spark
            identifier and vice versa.

            *So, I think it does make sense now for the
            refresh-state to key off the UUIDs and not use engine
            specific identifiers. *This also means that the
            materization consumer will have to fully expand the
            query tree and basically diff the UUID + latest snapshot
            ids against the refresh state.  Would it ever make sense
            for the Iceberg Catalog to expose a bulk lookup API by UUID?

            As a side note, it seems that for a materialized view to
            work with multiple engines, the default-catalog and
            default-namespace can't be used unless both engines use
            the same catalog name which seems pretty restrictive to me.

            Thanks for the great discussions
            Benny


            On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa
            <wa.moust...@gmail.com> wrote:

                Jan, we definitely can store SQL identifiers of
                multiple representations in Approach 1.

                The takeaway is that SQL identifiers are highly
                coupled with engines, just like views. It makes
                sense to track both together for consistency.

                Thanks,
                Walaa.

                On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
                <jank...@mailbox.org.invalid>
                <mailto:jank...@mailbox.org.invalid> wrote:

                    Walaa, thanks you for bringing up this use case.
                    I think we need to keep in mind that we require
                    identifiers to interface with the catalog. We
                    cannot use UUIDs.

                    Which means you also wouldn't be able to use
                    Approach 1 for your use case because you can't
                    store the catalog names of multiple
                    representations in the lineage. You would need
                    to fallback to parsing the SQL for a particular
                    representation and rebuilding the full query
                    tree to obtain the identifiers.

                    You could do the same for Approach 2. So I don't
                    see why Approach 1 would yield any benefits.

                    Thanks,

                    Jan

                    On 07.09.24 00:01, Steven Wu wrote:

                    Benny, `default-catalog` is optional, while
                    `default-namespace` is required.

                    I will retract my comment on the `summary`. it
                    indicates the engine that made the revision to
                    the current view version. it doesn't really
                    matter for multi-engine/representation support.

                    On Fri, Sep 6, 2024 at 2:49 PM Benny Chow
                    <btc...@gmail.com> wrote:

                        Steven - Ideally, the lineage is engine
                        agnostic so I'd hope it wouldn't have to be
                        under a specific representation.
                        Walaa - That's a serious concern... If the
                        same catalog is aliased differently by two
                        different engines, then the basic view spec
                        seems broken to me since
                        "default-namespace" includes the catalog
                        alias and is outside of the SQL
                        representation. Does that mean for a view
                        to be interoperable, we require different
                        engines to use the same catalog name?

                        Thanks

                        On Fri, Sep 6, 2024 at 1:29 PM Steven Wu
                        <stevenz...@gmail.com> wrote:

                            Walaa, thanks for bringing up the
                            interesting case of multiple
                            representations (for different
                            engines), which definitely requires
                            more discussion from the community.

                            When I am looking at the view spec, I
                            am seeing some conflict. "summary"
                            field seems meant for only one engine,
                            while "representations" support
                            multiple engines.

                            "summary" : {
                            
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
                            : "Spark",
                            
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
                            : "3.3.2"
                            
<https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
                            
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
                            : [ {
                            
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
                            : "sql",
                            
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
                            : "SELECT\n COUNT(1), CAST(event_ts AS
                            DATE)\nFROM events\nGROUP BY 2",
                            
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
                            : "spark"
                            
<https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
                            ]

                            With multiple representations/engines,
                            I guess one engine will be responsible
                            for the storage table refresh and other
                            engines are read only. If we want to
                            store the lineage info in the view, it
                            probably needs to be part of the
                            "representation" struct so that each
                            engine/representation stores its own
                            lineage info..
                            Who is to validate/ensure that the SQL
                            representation is actually semantically
                            identical (minus syntax differences
                            across engines)? I guess this
                            responsibility is left to the user who
                            owns and manages the view.

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to