Re: [DISCUSS] Iceberg Materialzied Views

Jan Kaul Thu, 19 Sep 2024 01:27:41 -0700

Cool, I guess it's easier to resolve these kind of things when talkingin person.

I agree with your requirements and the conclusion to use a map from UUIDto snapshot-id/version-id as the refresh-state, as well as dropping thelineage in favor to just re-parsing the SQL query. This gets us aroundthe "catalog alias" issue.

And I'm also OK with every engine requiring their own representation touse the MV.

There is still the issue with the identifier of the storage table andits catalog_name. Should we use an "PartialIdentifier" with a namespaceand a name field, like so:


{

    namespace: ["bronze"],

    name: "lineitem"

}

And require the storage table to be in the same catalog as the MV itself?

Thanks,

Jan

On 19.09.24 00:50, Benny Chow wrote:

Steven and I met up yesterday at the Seattle Iceberg meetup and we gotto talking about the "catalog alias" issue. He described it as anannoying problem =p


I think there are some key requirements we need to support:

1. Different engines can produce and consume shared MVs with freshnessvalidation.2. We cannot force different engines to standardize on the alias theyuse for the catalog.3. We cannot force different SQL representations to exclude catalognames from table identifiers or not use fully qualified table names.4. MV SQL can join tables and views from multiple catalogs ->Inevitable with Nessie, Polaris, Unity, Tabular and others...

The producing engine has to save refresh state information to letconsuming engine know that table X is at what snapshot at the time ofmaterialization. The only way to identify this table across differentcatalog names is to use the cross catalog, globally unique UUID. Ithink our only option is to have the refresh state map UUID tosnapshot ids and view version ids.

Assuming the above is how we store the refresh state, how does theconsuming engine determine the current snapshot ids? The consumingengine will have to fully expand the query tree at which point it willhave the UUIDs as well as the latest snapshot ids/view versions. Thiscan then be diffed against the materialization refresh state todetermine freshness. There isn't a need to store the view lineageinformation to map from UUID to the consumer specific identifier sothat the consumer can then call back into the catalog with thatidentifier to get the latest state. The consuming engine might aswell just re-parse the SQL and expand the query.

Personally, I'm OK with requiring that an engine must have its own SQLrepresentation in order to use the MV. To me, being able to fulfillthe key requirements above is much more important.


Thanks
Benny

On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul <jank...@mailbox.org.invalid>wrote:


    How about we make the /catalog_name field/ of the identifier
    optional? If the field is missing, it references a table/view in
    the same catalog. If it is present it has to be an engine agnostic
    catalog name. Shouldn't the catalog_names from the REST catalog
    spec be engine agnostic?

    I was wondering, is there no way to prescribe a catalog_name in
    Spark or Dremio? What do you do if you include two Nessie
    catalogs? They can't both be called LocalNessie.

    Thanks,

    Jan

    On 14.09.24 01:23, Benny Chow wrote:

    The main reason for putting the lineage into the view is so that
    "another" engine can enumerate out the tables in the view without
    needing to parse any SQL.  But, if we put the lineage under the
    SQL representation with engine specific catalog names, the
    "other" engine is not going to be able to use those identifiers
    to look up the tables.  The "other" engine can only lookup those
    identifiers using its engine specific catalog name.  It may be
    possible to enumerate the tables at the view version level ONLY
    if those identifiers don't include the catalog name.  However, if
    you have a view with a cross catalog join, then the tables coming
    from the other catalog have to be fully qualified.  But then the
    problem is that each engine will also alias the other catalog
    differently too.

    So, I think to summarize *multi-engine* view interoperability:

      * default-catalog can't be specified
      * default-namespace can be specified
      * View SQL can only references tables/views from the same catalog

    I think these are reasonable constraints for multi-engine use
    cases.  If reasonable, for MVs, then the storage table,
    refresh-state and lineage (at the view version level), could all
    be based on *engine agnostic* identifiers without the catalog
    name.  The MV and storage table would have to be in the same catalog.

    Thanks
    Benny



    On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul
    <jank...@mailbox.org.invalid>
    <mailto:jank...@mailbox.org.invalid> wrote:

        Hi,

        regarding our recent discussion on table identifiers with
        respect to different catalog_names with different query
        engines. We have the same problem when we want to reference
        the storage table from the common view. *If we include the
        catalog_name as part of the identifier, different query
        engines might not be able to load the storage table.
        *
        We could enforce that every storage table has to be part of
        the same catalog as the main view. This way an identifier
        without the catalog_name would be enough to point to the
        correct storage table.

        What are your thoughts on this?

        Best wishes,

        Jan

        On 11.09.24 16:05, Walaa Eldin Moustafa wrote:

        I think this type of discussion is exactly what motivates a
        clarification in the view spec so that we can resolve MV
        lineage. Will create separate thread for view spec
        clarification.

        Following up on Jan’s point, yes I agree in order to support
        catalog name, it should be at the representation level, but
        catalog name does not really depend on the “dialect” but
        rather on the “engine”; hence the discussion becomes a
        little more involved.

        Thanks,
        Walaa.

        On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul
        <jank...@mailbox.org.invalid>
        <mailto:jank...@mailbox.org.invalid> wrote:

            Hi Benny,

            I think that identifiers only being defined for a
            certain representation is exactly what we want. Each
            representation can define their own identifiers that
            then map to an UUID. This way the "catalog_name" of the
            identifier for a "Spark" dialect can be different then
            for a "Dremio" dialect.

            The important part is that we still have a list of
            identifiers for each representation that we can use with
            the catalog to obtain the state of the source tables.

            Best wishes,

            Jan

            On 11.09.24 01:33, Benny Chow wrote:

            Hi Walaa, I don't think the current view spec
            implicitly assumes a common catalog name between
            engines.  I tested this by not specifying the
            default-catalog and both engines could look up the
            correct table under the shared default-namespace even
            when each engine uses a different catalog name.

            Hi Jan, I think the issue with putting the lineage as
            part of the representation is that that identifier only
            makes sense for that representation's engine.  In your
            example, the catalog aliased as "iceberg" in spark is
            going to have a different name in Dremio or Trino.

            IMO, if we are to store a lineage for a view, it should
            consist of something engine agnostic like the
            table/view UUIDs.  This would be stored at the view
            version level and not the representation level.  I
            think as we get into more of these multi-engine,
            multi-catalog use cases for views, the Iceberg Catalog
            is going to need to do a better job at handling CRUD by
            UUID instead of engine specific identifiers.  Another
            scenario we need to think through is a view that joins
            tables from two different catalogs.  How would we
            represent the lineage for that in an engine agnostic way?

            Thanks
            Benny



            On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul
            <jank...@mailbox.org.invalid>
            <mailto:jank...@mailbox.org.invalid> wrote:

                Thanks Walaa and Benny for clarifying the problem.
                I think I have a better understanding now. Sorry
                for being a bit stubborn before.

                Wouldn't it make sense then to store the lineage as
                part of the representation:

                {

                    "type": "sql",

                    "sql": "SELECT\n COUNT(1), CAST(event_ts AS
                DATE)\nFROM events\nGROUP BY 2",

                    "dialect": "spark",

                    "lineage": [{

                        "identifier": { "catalog": "iceberg",
                "namespace": "public", "table": "events"},

                        "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"

                    }]

                }

                Best wishes,

                Jan

                On 09.09.24 11:59, Walaa Eldin Moustafa wrote:

                Benny, thank you so much for performing the
                experiment. Glad that using UUIDs as keys in the
                state map makes more sense now.

                For the issue with the view spec being
                restrictive, I agree and I have raised the concern
                on the view spec PR last year [1]. I think there
                is some area of improvement here. At the least, if
                it is restrictive, it should be explicitly stated.
                I will start a thread on how to approach the view
                spec. We may need to get more insight on the view
                spec before finalizing the MV spec, because view
                spec will determine if we should proceed with one
                lineage (with the implicitly assumed common
                catalog name), or with multiple lineages (one per
                engine or catalog name).

                [1]
                
https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619

                Thanks,
                Walaa.


                On Mon, Sep 9, 2024 at 3:28 AM Benny Chow
                <btc...@gmail.com> wrote:

                    Hi Walaa

                    I did some testing with two different engines
                    (Spark and Dremio) against the same Nessie
                    catalog and created the attached materialized
                    view metadata.json.  I see your point now
                    about the SQL identifiers being tightly
                    coupled to the engines.  In the metadata JSON,
                    spark refers to the catalog as "SparkNessie",
                    whereas Dremio refers to the catalog as
                    "LocalNessie".  So, this means that the fully
                    qualified view and table identifiers are
                    engine specific and Dremio can't lookup a
                    Spark identifier and vice versa.

                    *So, I think it does make sense now for the
                    refresh-state to key off the UUIDs and not use
                    engine specific identifiers. *This also means
                    that the materization consumer will have to
                    fully expand the query tree and basically diff
                    the UUID + latest snapshot ids against the
                    refresh state.  Would it ever make sense for
                    the Iceberg Catalog to expose a bulk lookup
                    API by UUID?

                    As a side note, it seems that for a
                    materialized view to work with multiple
                    engines, the default-catalog and
                    default-namespace can't be used unless both
                    engines use the same catalog name which seems
                    pretty restrictive to me.

                    Thanks for the great discussions
                    Benny


                    On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin
                    Moustafa <wa.moust...@gmail.com> wrote:

                        Jan, we definitely can store SQL
                        identifiers of multiple representations in
                        Approach 1.

                        The takeaway is that SQL identifiers are
                        highly coupled with engines, just like
                        views. It makes sense to track both
                        together for consistency.

                        Thanks,
                        Walaa.

                        On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
                        <jank...@mailbox.org.invalid>
                        <mailto:jank...@mailbox.org.invalid> wrote:

                            Walaa, thanks you for bringing up this
                            use case. I think we need to keep in
                            mind that we require identifiers to
                            interface with the catalog. We cannot
                            use UUIDs.

                            Which means you also wouldn't be able
                            to use Approach 1 for your use case
                            because you can't store the catalog
                            names of multiple representations in
                            the lineage. You would need to
                            fallback to parsing the SQL for a
                            particular representation and
                            rebuilding the full query tree to
                            obtain the identifiers.

                            You could do the same for Approach 2.
                            So I don't see why Approach 1 would
                            yield any benefits.

                            Thanks,

                            Jan

                            On 07.09.24 00:01, Steven Wu wrote:

                            Benny, `default-catalog` is optional,
                            while `default-namespace` is required.

                            I will retract my comment on the
                            `summary`. it indicates the engine
                            that made the revision to the current
                            view version. it doesn't really
                            matter for
                            multi-engine/representation support.

                            On Fri, Sep 6, 2024 at 2:49 PM Benny
                            Chow <btc...@gmail.com> wrote:

                                Steven - Ideally, the lineage is
                                engine agnostic so I'd hope it
                                wouldn't have to be under a
                                specific representation.
                                Walaa - That's a serious
                                concern...  If the same catalog
                                is aliased differently by two
                                different engines, then the basic
                                view spec seems broken to me
                                since "default-namespace"
                                includes the catalog alias and is
                                outside of the SQL
                                representation. Does that mean
                                for a view to be interoperable,
                                we require different engines to
                                use the same catalog name?

                                Thanks

                                On Fri, Sep 6, 2024 at 1:29 PM
                                Steven Wu <stevenz...@gmail.com>
                                wrote:

                                    Walaa, thanks for bringing up
                                    the interesting case of
                                    multiple representations (for
                                    different engines), which
                                    definitely requires more
                                    discussion from the community.

                                    When I am looking at the view
                                    spec, I am seeing some
                                    conflict. "summary" field
                                    seems meant for only one
                                    engine, while
                                    "representations" support
                                    multiple engines.

                                    "summary" : {
                                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
                                    : "Spark",
                                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
                                    : "3.3.2"
                                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
                                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
                                    : [ {
                                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
                                    : "sql",
                                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
                                    : "SELECT\n COUNT(1),
                                    CAST(event_ts AS DATE)\nFROM
                                    events\nGROUP BY 2",
                                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
                                    : "spark"
                                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
                                    ]

                                    With multiple
                                    representations/engines, I
                                    guess one engine will be
                                    responsible for the storage
                                    table refresh and other
                                    engines are read only. If we
                                    want to store the lineage
                                    info in the view, it probably
                                    needs to be part of the
                                    "representation" struct so
                                    that each
                                    engine/representation stores
                                    its own lineage info..
                                    Who is to validate/ensure
                                    that the SQL representation
                                    is actually semantically
                                    identical (minus syntax
                                    differences across engines)?
                                    I guess this responsibility
                                    is left to the user who owns
                                    and manages the view.

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to