Re: [DISCUSS] Iceberg Materialzied Views

Jan Kaul Sat, 14 Sep 2024 02:01:04 -0700

How about we make the /catalog_name field/ of the identifier optional?If the field is missing, it references a table/view in the same catalog.If it is present it has to be an engine agnostic catalog name. Shouldn'tthe catalog_names from the REST catalog spec be engine agnostic?

I was wondering, is there no way to prescribe a catalog_name in Spark orDremio? What do you do if you include two Nessie catalogs? They can'tboth be called LocalNessie.


Thanks,

Jan

On 14.09.24 01:23, Benny Chow wrote:

The main reason for putting the lineage into the view is so that"another" engine can enumerate out the tables in the view withoutneeding to parse any SQL. But, if we put the lineage under the SQLrepresentation with engine specific catalog names, the "other" engineis not going to be able to use those identifiers to look up thetables. The "other" engine can only lookup those identifiers usingits engine specific catalog name. It may be possible to enumerate thetables at the view version level ONLY if those identifiers don'tinclude the catalog name. However, if you have a view with a crosscatalog join, then the tables coming from the other catalog have to befully qualified. But then the problem is that each engine will alsoalias the other catalog differently too.


So, I think to summarize *multi-engine* view interoperability:

  * default-catalog can't be specified
  * default-namespace can be specified
  * View SQL can only references tables/views from the same catalog

I think these are reasonable constraints for multi-engine use cases. If reasonable, for MVs, then the storage table, refresh-state andlineage (at the view version level), could all be based on *engineagnostic* identifiers without the catalog name. The MV and storagetable would have to be in the same catalog.


Thanks
Benny

On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul <jank...@mailbox.org.invalid>wrote:


    Hi,

    regarding our recent discussion on table identifiers with respect
    to different catalog_names with different query engines. We have
    the same problem when we want to reference the storage table from
    the common view. *If we include the catalog_name as part of the
    identifier, different query engines might not be able to load the
    storage table.
    *
    We could enforce that every storage table has to be part of the
    same catalog as the main view. This way an identifier without the
    catalog_name would be enough to point to the correct storage table.

    What are your thoughts on this?

    Best wishes,

    Jan

    On 11.09.24 16:05, Walaa Eldin Moustafa wrote:

    I think this type of discussion is exactly what motivates a
    clarification in the view spec so that we can resolve MV lineage.
    Will create separate thread for view spec clarification.

    Following up on Jan’s point, yes I agree in order to support
    catalog name, it should be at the representation level, but
    catalog name does not really depend on the “dialect” but rather
    on the “engine”; hence the discussion becomes a little more involved.

    Thanks,
    Walaa.

    On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul
    <jank...@mailbox.org.invalid>
    <mailto:jank...@mailbox.org.invalid> wrote:

        Hi Benny,

        I think that identifiers only being defined for a certain
        representation is exactly what we want. Each representation
        can define their own identifiers that then map to an UUID.
        This way the "catalog_name" of the identifier for a "Spark"
        dialect can be different then for a "Dremio" dialect.

        The important part is that we still have a list of
        identifiers for each representation that we can use with the
        catalog to obtain the state of the source tables.

        Best wishes,

        Jan

        On 11.09.24 01:33, Benny Chow wrote:

        Hi Walaa, I don't think the current view spec implicitly
        assumes a common catalog name between engines.  I tested
        this by not specifying the default-catalog and both engines
        could look up the correct table under the shared
        default-namespace even when each engine uses a different
        catalog name.

        Hi Jan, I think the issue with putting the lineage as part
        of the representation is that that identifier only makes
        sense for that representation's engine.  In your example,
        the catalog aliased as "iceberg" in spark is going to have a
        different name in Dremio or Trino.

        IMO, if we are to store a lineage for a view, it should
        consist of something engine agnostic like the table/view
        UUIDs.  This would be stored at the view version level and
        not the representation level.  I think as we get into more
        of these multi-engine, multi-catalog use cases for views,
        the Iceberg Catalog is going to need to do a better job at
        handling CRUD by UUID instead of engine specific
        identifiers.  Another scenario we need to think through is a
        view that joins tables from two different catalogs.  How
        would we represent the lineage for that in an engine
        agnostic way?

        Thanks
        Benny



        On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul
        <jank...@mailbox.org.invalid>
        <mailto:jank...@mailbox.org.invalid> wrote:

            Thanks Walaa and Benny for clarifying the problem. I
            think I have a better understanding now. Sorry for being
            a bit stubborn before.

            Wouldn't it make sense then to store the lineage as part
            of the representation:

            {

                "type": "sql",

                "sql": "SELECT\n COUNT(1), CAST(event_ts AS
            DATE)\nFROM events\nGROUP BY 2",

                "dialect": "spark",

                "lineage": [{

                    "identifier": { "catalog": "iceberg",
            "namespace": "public", "table": "events"},

                    "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"

                }]

            }

            Best wishes,

            Jan

            On 09.09.24 11:59, Walaa Eldin Moustafa wrote:

            Benny, thank you so much for performing the experiment.
            Glad that using UUIDs as keys in the state map makes
            more sense now.

            For the issue with the view spec being restrictive, I
            agree and I have raised the concern on the view spec PR
            last year [1]. I think there is some area of
            improvement here. At the least, if it is restrictive,
            it should be explicitly stated. I will start a thread
            on how to approach the view spec. We may need to get
            more insight on the view spec before finalizing the MV
            spec, because view spec will determine if we should
            proceed with one lineage (with the implicitly assumed
            common catalog name), or with multiple lineages (one
            per engine or catalog name).

            [1]
            https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619

            Thanks,
            Walaa.


            On Mon, Sep 9, 2024 at 3:28 AM Benny Chow
            <btc...@gmail.com> wrote:

                Hi Walaa

                I did some testing with two different engines
                (Spark and Dremio) against the same Nessie catalog
                and created the attached materialized view
                metadata.json.  I see your point now about the SQL
                identifiers being tightly coupled to the engines. 
                In the metadata JSON, spark refers to the catalog
                as "SparkNessie", whereas Dremio refers to the
                catalog as "LocalNessie".  So, this means that the
                fully qualified view and table identifiers are
                engine specific and Dremio can't lookup a Spark
                identifier and vice versa.

                *So, I think it does make sense now for the
                refresh-state to key off the UUIDs and not use
                engine specific identifiers. *This also means that
                the materization consumer will have to fully expand
                the query tree and basically diff the UUID + latest
                snapshot ids against the refresh state. Would it
                ever make sense for the Iceberg Catalog to expose a
                bulk lookup API by UUID?

                As a side note, it seems that for a materialized
                view to work with multiple engines, the
                default-catalog and default-namespace can't be used
                unless both engines use the same catalog name which
                seems pretty restrictive to me.

                Thanks for the great discussions
                Benny


                On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa
                <wa.moust...@gmail.com> wrote:

                    Jan, we definitely can store SQL identifiers of
                    multiple representations in Approach 1.

                    The takeaway is that SQL identifiers are highly
                    coupled with engines, just like views. It makes
                    sense to track both together for consistency.

                    Thanks,
                    Walaa.

                    On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
                    <jank...@mailbox.org.invalid>
                    <mailto:jank...@mailbox.org.invalid> wrote:

                        Walaa, thanks you for bringing up this use
                        case. I think we need to keep in mind that
                        we require identifiers to interface with
                        the catalog. We cannot use UUIDs.

                        Which means you also wouldn't be able to
                        use Approach 1 for your use case because
                        you can't store the catalog names of
                        multiple representations in the lineage.
                        You would need to fallback to parsing the
                        SQL for a particular representation and
                        rebuilding the full query tree to obtain
                        the identifiers.

                        You could do the same for Approach 2. So I
                        don't see why Approach 1 would yield any
                        benefits.

                        Thanks,

                        Jan

                        On 07.09.24 00:01, Steven Wu wrote:

                        Benny, `default-catalog` is optional,
                        while `default-namespace` is required.

                        I will retract my comment on the
                        `summary`. it indicates the engine that
                        made the revision to the current view
                        version. it doesn't really matter for
                        multi-engine/representation support.

                        On Fri, Sep 6, 2024 at 2:49 PM Benny Chow
                        <btc...@gmail.com> wrote:

                            Steven - Ideally, the lineage is
                            engine agnostic so I'd hope it
                            wouldn't have to be under a specific
                            representation.
                            Walaa - That's a serious concern... 
                            If the same catalog is aliased
                            differently by two different engines,
                            then the basic view spec seems broken
                            to me since "default-namespace"
                            includes the catalog alias and is
                            outside of the SQL representation.
                            Does that mean for a view to be
                            interoperable, we require different
                            engines to use the same catalog name?

                            Thanks

                            On Fri, Sep 6, 2024 at 1:29 PM Steven
                            Wu <stevenz...@gmail.com> wrote:

                                Walaa, thanks for bringing up the
                                interesting case of multiple
                                representations (for different
                                engines), which definitely
                                requires more discussion from the
                                community.

                                When I am looking at the view
                                spec, I am seeing some conflict.
                                "summary" field seems meant for
                                only one engine, while
                                "representations" support multiple
                                engines.

                                "summary" : {
                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
                                : "Spark",
                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
                                : "3.3.2"
                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
                                : [ {
                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
                                : "sql",
                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
                                : "SELECT\n COUNT(1),
                                CAST(event_ts AS DATE)\nFROM
                                events\nGROUP BY 2",
                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
                                : "spark"
                                
<https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
                                ]

                                With multiple
                                representations/engines, I guess
                                one engine will be responsible for
                                the storage table refresh and
                                other engines are read only. If we
                                want to store the lineage info in
                                the view, it probably needs to be
                                part of the "representation"
                                struct so that each
                                engine/representation stores its
                                own lineage info..
                                Who is to validate/ensure that the
                                SQL representation is actually
                                semantically identical (minus
                                syntax differences across
                                engines)? I guess this
                                responsibility is left to the user
                                who owns and manages the view.

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to