Re: [DISCUSS] Iceberg Materialzied Views

Jan Kaul Fri, 20 Sep 2024 00:16:32 -0700

Hi Walaa,

It appears that you would like to maintain the lineage structure and notrevert to parsing the SQL to obtain identifiers.

Initially, one of the reasons for avoiding SQL parsing was to enableconsumers who don't understand the SQL dialect of any representation todetermine the freshness of the Materialized View (MV). However, with the"catalog alias" issue, having an identifier for some representation isinsufficient, as the /catalog_name/ is unlikely to work for theconsumer. Therefore, supporting consumers that don't use a query engineof any representation seems impossible.

Given this, parsing the SQL definition becomes a less significantdrawback, as the consumer must understand the dialect anyway. In fact,simply parsing the SQL definition seems like a more robust andstraightforward solution than using a lineage for every representation.I believe this is why Benny suggested reverting to SQL parsing, and Iagree with him.

Regarding the Storage table identifier: Its design as a/PartialIdentifier/ with only namespace and name fields was intentional,to avoid the /catalog_name/ issue.


Best regards,

Jan

On 19.09.24 23:16, Benny Chow wrote:

If Spark added the storage table identifier to the MV, I'm not surehow it could also add a full identifier to the Dremio representation. Spark doesn't know what name Dremio used for the catalog.

For the UX issue, I think Jan cleverly called it a "PartialIdentifier"and not a "FullIdentifier" to indicate that catalog name is not even aproperty of the identifier.

Requirement 3 is for the view's SQL. I'm not sure there is a verystrong use case to put the storage table into a different catalog thanthe view. If we had an engine agnostic solution for it, I'm all forit though...


Thanks
Benny

On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa<[email protected]> wrote:


    I think the solution for the storage identifier might be shared
    with the end state solution for the lineage. One could imagine a
    "full identifier" can be used for the storage table; however, it
    is "representation"-dependent (i.e., it changes according to
    which representation it is part of, or rather which engine uses it).

    Also, are we asking engines (or their Iceberg implementation) to
    throw an exception if the full storage table identifier was
    provided as part of the MV definition? Sounds like a not very
    ideal UX. Note that it also conflicts with the spirit of
    requirement #3.

    Thanks,
    Walaa.

    On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <[email protected]> wrote:

        Hi Jan

        "PartialIdentifier" without the catalog name sounds good to
        me.  The storage table and MV have to be in the same catalog. 
        That would be a good fifth requirement to add to the list.

        Thanks
        Benny

        On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul
        <[email protected]> wrote:

            Cool, I guess it's easier to resolve these kind of things
            when talking in person.

            I agree with your requirements and the conclusion to use a
            map from UUID to snapshot-id/version-id as the
            refresh-state, as well as dropping the lineage in favor to
            just re-parsing the SQL query. This gets us around the
            "catalog alias" issue.

            And I'm also OK with every engine requiring their own
            representation to use the MV.

            There is still the issue with the identifier of the
            storage table and its catalog_name. Should we use an
            "PartialIdentifier" with a namespace and a name field,
            like so:

            {

                namespace: ["bronze"],

                name: "lineitem"

            }

            And require the storage table to be in the same catalog as
            the MV itself?

            Thanks,

            Jan

            On 19.09.24 00:50, Benny Chow wrote:

            Steven and I met up yesterday at the Seattle Iceberg
            meetup and we got to talking about the "catalog alias"
            issue.  He described it as an annoying problem =p

            I think there are some key requirements we need to support:

            1. Different engines can produce and consume shared MVs
            with freshness validation.
            2. We cannot force different engines to standardize on
            the alias they use for the catalog.
            3. We cannot force different SQL representations to
            exclude catalog names from table identifiers or not use
            fully qualified table names.
            4. MV SQL can join tables and views from multiple
            catalogs -> Inevitable with Nessie, Polaris, Unity,
            Tabular and others...

            The producing engine has to save refresh state
            information to let consuming engine know that table X is
            at what snapshot at the time of materialization.  The
            only way to identify this table across different catalog
            names is to use the cross catalog, globally unique UUID. 
            I think our only option is to have the refresh state map
            UUID to snapshot ids and view version ids.

            Assuming the above is how we store the refresh state, how
            does the consuming engine determine the current snapshot
            ids?  The consuming engine will have to fully expand the
            query tree at which point it will have the UUIDs as well
            as the latest snapshot ids/view versions.  This can then
            be diffed against the materialization refresh state to
            determine freshness.  There isn't a need to store the
            view lineage information to map from UUID to the consumer
            specific identifier so that the consumer can then call
            back into the catalog with that identifier to get the
            latest state. The consuming engine might as well just
            re-parse the SQL and expand the query.

            Personally, I'm OK with requiring that an engine must
            have its own SQL representation in order to use the MV. 
            To me, being able to fulfill the key requirements above
            is much more important.

            Thanks
            Benny

            On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul
            <[email protected]>
            <mailto:[email protected]> wrote:

                How about we make the /catalog_name field/ of the
                identifier optional? If the field is missing, it
                references a table/view in the same catalog. If it is
                present it has to be an engine agnostic catalog name.
                Shouldn't the catalog_names from the REST catalog
                spec be engine agnostic?

                I was wondering, is there no way to prescribe a
                catalog_name in Spark or Dremio? What do you do if
                you include two Nessie catalogs? They can't both be
                called LocalNessie.

                Thanks,

                Jan

                On 14.09.24 01:23, Benny Chow wrote:

                The main reason for putting the lineage into the
                view is so that "another" engine can enumerate out
                the tables in the view without needing to parse any
                SQL.  But, if we put the lineage under the SQL
                representation with engine specific catalog names,
                the "other" engine is not going to be able to use
                those identifiers to look up the tables.  The
                "other" engine can only lookup those identifiers
                using its engine specific catalog name.  It may be
                possible to enumerate the tables at the view version
                level ONLY if those identifiers don't include the
                catalog name.  However, if you have a view with a
                cross catalog join, then the tables coming from the
                other catalog have to be fully qualified.  But then
                the problem is that each engine will also alias the
                other catalog differently too.

                So, I think to summarize *multi-engine* view
                interoperability:

                  * default-catalog can't be specified
                  * default-namespace can be specified
                  * View SQL can only references tables/views from
                    the same catalog

                I think these are reasonable constraints for
                multi-engine use cases.  If reasonable, for MVs,
                then the storage table, refresh-state and lineage
                (at the view version level), could all be based on
                *engine agnostic* identifiers without the catalog
                name.  The MV and storage table would have to be in
                the same catalog.

                Thanks
                Benny



                On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul
                <[email protected]>
                <mailto:[email protected]> wrote:

                    Hi,

                    regarding our recent discussion on table
                    identifiers with respect to different
                    catalog_names with different query engines. We
                    have the same problem when we want to reference
                    the storage table from the common view. *If we
                    include the catalog_name as part of the
                    identifier, different query engines might not be
                    able to load the storage table.
                    *
                    We could enforce that every storage table has to
                    be part of the same catalog as the main view.
                    This way an identifier without the catalog_name
                    would be enough to point to the correct storage
                    table.

                    What are your thoughts on this?

                    Best wishes,

                    Jan

                    On 11.09.24 16:05, Walaa Eldin Moustafa wrote:

                    I think this type of discussion is exactly what
                    motivates a clarification in the view spec so
                    that we can resolve MV lineage. Will create
                    separate thread for view spec clarification.

                    Following up on Jan’s point, yes I agree in
                    order to support catalog name, it should be at
                    the representation level, but catalog name does
                    not really depend on the “dialect” but rather
                    on the “engine”; hence the discussion becomes a
                    little more involved.

                    Thanks,
                    Walaa.

                    On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul
                    <[email protected]>
                    <mailto:[email protected]> wrote:

                        Hi Benny,

                        I think that identifiers only being defined
                        for a certain representation is exactly
                        what we want. Each representation can
                        define their own identifiers that then map
                        to an UUID. This way the "catalog_name" of
                        the identifier for a "Spark" dialect can be
                        different then for a "Dremio" dialect.

                        The important part is that we still have a
                        list of identifiers for each representation
                        that we can use with the catalog to obtain
                        the state of the source tables.

                        Best wishes,

                        Jan

                        On 11.09.24 01:33, Benny Chow wrote:

                        Hi Walaa, I don't think the current view
                        spec implicitly assumes a common catalog
                        name between engines.  I tested this by
                        not specifying the default-catalog and
                        both engines could look up the correct
                        table under the shared default-namespace
                        even when each engine uses a different
                        catalog name.

                        Hi Jan, I think the issue with putting the
                        lineage as part of the representation is
                        that that identifier only makes sense for
                        that representation's engine.  In your
                        example, the catalog aliased as "iceberg"
                        in spark is going to have a different name
                        in Dremio or Trino.

                        IMO, if we are to store a lineage for a
                        view, it should consist of something
                        engine agnostic like the table/view UUIDs.
                        This would be stored at the view version
                        level and not the representation level.  I
                        think as we get into more of these
                        multi-engine, multi-catalog use cases for
                        views, the Iceberg Catalog is going to
                        need to do a better job at handling CRUD
                        by UUID instead of engine specific
                        identifiers. Another scenario we need to
                        think through is a view that joins tables
                        from two different catalogs.  How would we
                        represent the lineage for that in an
                        engine agnostic way?

                        Thanks
                        Benny



                        On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul
                        <[email protected]>
                        <mailto:[email protected]> wrote:

                            Thanks Walaa and Benny for clarifying
                            the problem. I think I have a better
                            understanding now. Sorry for being a
                            bit stubborn before.

                            Wouldn't it make sense then to store
                            the lineage as part of the representation:

                            {

                                "type": "sql",

                                "sql": "SELECT\n COUNT(1),
                            CAST(event_ts AS DATE)\nFROM
                            events\nGROUP BY 2",

                            "dialect": "spark",

                            "lineage": [{

                            "identifier": { "catalog": "iceberg",
                            "namespace": "public", "table":
                            "events"},

                            "uuid":
                            "fa6506c3-7681-40c8-86dc-e36561f83385"

                                }]

                            }

                            Best wishes,

                            Jan

                            On 09.09.24 11:59, Walaa Eldin
                            Moustafa wrote:

                            Benny, thank you so much for
                            performing the experiment. Glad that
                            using UUIDs as keys in the state map
                            makes more sense now.

                            For the issue with the view spec
                            being restrictive, I agree and I have
                            raised the concern on the view spec
                            PR last year [1]. I think there is
                            some area of improvement here. At the
                            least, if it is restrictive, it
                            should be explicitly stated. I will
                            start a thread on how to approach the
                            view spec. We may need to get more
                            insight on the view spec before
                            finalizing the MV spec, because view
                            spec will determine if we should
                            proceed with one lineage (with the
                            implicitly assumed common catalog
                            name), or with multiple lineages (one
                            per engine or catalog name).

                            [1]
                            
https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619

                            Thanks,
                            Walaa.


                            On Mon, Sep 9, 2024 at 3:28 AM Benny
                            Chow <[email protected]> wrote:

                                Hi Walaa

                                I did some testing with two
                                different engines (Spark and
                                Dremio) against the same Nessie
                                catalog and created the attached
                                materialized view metadata.json.
                                I see your point now about the
                                SQL identifiers being tightly
                                coupled to the engines.  In the
                                metadata JSON, spark refers to
                                the catalog as "SparkNessie",
                                whereas Dremio refers to the
                                catalog as "LocalNessie". So,
                                this means that the fully
                                qualified view and table
                                identifiers are engine specific
                                and Dremio can't lookup a Spark
                                identifier and vice versa.

                                *So, I think it does make sense
                                now for the refresh-state to key
                                off the UUIDs and not use engine
                                specific identifiers. *This also
                                means that the materization
                                consumer will have to fully
                                expand the query tree and
                                basically diff the UUID + latest
                                snapshot ids against the refresh
                                state.  Would it ever make sense
                                for the Iceberg Catalog to expose
                                a bulk lookup API by UUID?

                                As a side note, it seems that for
                                a materialized view to work with
                                multiple engines, the
                                default-catalog and
                                default-namespace can't be used
                                unless both engines use the same
                                catalog name which seems pretty
                                restrictive to me.

                                Thanks for the great discussions
                                Benny


                                On Sat, Sep 7, 2024 at 2:49 AM
                                Walaa Eldin Moustafa
                                <[email protected]> wrote:

                                    Jan, we definitely can store
                                    SQL identifiers of multiple
                                    representations in Approach 1.

                                    The takeaway is that SQL
                                    identifiers are highly
                                    coupled with engines, just
                                    like views. It makes sense to
                                    track both together for
                                    consistency.

                                    Thanks,
                                    Walaa.

                                    On Sat, Sep 7, 2024 at
                                    8:15 AM Jan Kaul
                                    <[email protected]>
                                    <mailto:[email protected]>
                                    wrote:

                                        Walaa, thanks you for
                                        bringing up this use
                                        case. I think we need to
                                        keep in mind that we
                                        require identifiers to
                                        interface with the
                                        catalog. We cannot use
                                        UUIDs.

                                        Which means you also
                                        wouldn't be able to use
                                        Approach 1 for your use
                                        case because you can't
                                        store the catalog names
                                        of multiple
                                        representations in the
                                        lineage. You would need
                                        to fallback to parsing
                                        the SQL for a particular
                                        representation and
                                        rebuilding the full query
                                        tree to obtain the
                                        identifiers.

                                        You could do the same for
                                        Approach 2. So I don't
                                        see why Approach 1 would
                                        yield any benefits.

                                        Thanks,

                                        Jan

                                        On 07.09.24 00:01, Steven
                                        Wu wrote:

                                        Benny, `default-catalog`
                                        is optional, while
                                        `default-namespace` is
                                        required.

                                        I will retract my
                                        comment on the
                                        `summary`. it indicates
                                        the engine that made the
                                        revision to the current
                                        view version. it doesn't
                                        really matter for
                                        multi-engine/representation
                                        support.

                                        On Fri, Sep 6, 2024 at
                                        2:49 PM Benny Chow
                                        <[email protected]> wrote:

                                            Steven - Ideally,
                                            the lineage is
                                            engine agnostic so
                                            I'd hope it wouldn't
                                            have to be under a
                                            specific representation.
                                            Walaa - That's a
                                            serious concern... 
                                            If the same catalog
                                            is aliased
                                            differently by two
                                            different engines,
                                            then the basic view
                                            spec seems broken to
                                            me since
                                            "default-namespace"
                                            includes the catalog
                                            alias and is outside
                                            of the SQL
                                            representation. Does
                                            that mean for a view
                                            to be interoperable,
                                            we require different
                                            engines to use the
                                            same catalog name?

                                            Thanks

                                            On Fri, Sep 6, 2024
                                            at 1:29 PM Steven Wu
                                            <[email protected]>
                                            wrote:

                                                Walaa, thanks
                                                for bringing up
                                                the interesting
                                                case of multiple
                                                representations
                                                (for different
                                                engines), which
                                                definitely
                                                requires more
                                                discussion from
                                                the community.

                                                When I am
                                                looking at the
                                                view spec, I am
                                                seeing some
                                                conflict.
                                                "summary" field
                                                seems meant for
                                                only one engine,
                                                while
                                                "representations"
                                                support multiple
                                                engines.

                                                "summary" : {
                                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
                                                : "Spark",
                                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
                                                : "3.3.2"
                                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
                                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
                                                : [ {
                                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
                                                : "sql",
                                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
                                                : "SELECT\n
                                                COUNT(1),
                                                CAST(event_ts AS
                                                DATE)\nFROM
                                                events\nGROUP BY
                                                2",
                                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
                                                : "spark"
                                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
                                                ]

                                                With multiple
                                                representations/engines,
                                                I guess one
                                                engine will be
                                                responsible for
                                                the storage
                                                table refresh
                                                and other
                                                engines are read
                                                only. If we want
                                                to store the
                                                lineage info in
                                                the view, it
                                                probably
                                                needs to be part
                                                of the
                                                "representation"
                                                struct so that
                                                each
                                                engine/representation
                                                stores its own
                                                lineage info..
                                                Who is to
                                                validate/ensure
                                                that the SQL
                                                representation
                                                is actually
                                                semantically
                                                identical (minus
                                                syntax
                                                differences
                                                across engines)?
                                                I guess this
                                                responsibility
                                                is left to the
                                                user who owns
                                                and manages the
                                                view.

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to