Hi Benny,

I think that identifiers only being defined for a certain representation is exactly what we want. Each representation can define their own identifiers that then map to an UUID. This way the "catalog_name" of the identifier for a "Spark" dialect can be different then for a "Dremio" dialect.

The important part is that we still have a list of identifiers for each representation that we can use with the catalog to obtain the state of the source tables.

Best wishes,

Jan

On 11.09.24 01:33, Benny Chow wrote:
Hi Walaa, I don't think the current view spec implicitly assumes a common catalog name between engines.  I tested this by not specifying the default-catalog and both engines could look up the correct table under the shared default-namespace even when each engine uses a different catalog name.

Hi Jan, I think the issue with putting the lineage as part of the representation is that that identifier only makes sense for that representation's engine.  In your example, the catalog aliased as "iceberg" in spark is going to have a different name in Dremio or Trino.

IMO, if we are to store a lineage for a view, it should consist of something engine agnostic like the table/view UUIDs.  This would be stored at the view version level and not the representation level.  I think as we get into more of these multi-engine, multi-catalog use cases for views, the Iceberg Catalog is going to need to do a better job at handling CRUD by UUID instead of engine specific identifiers. Another scenario we need to think through is a view that joins tables from two different catalogs.  How would we represent the lineage for that in an engine agnostic way?

Thanks
Benny



On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul <jank...@mailbox.org.invalid> wrote:

    Thanks Walaa and Benny for clarifying the problem. I think I have
    a better understanding now. Sorry for being a bit stubborn before.

    Wouldn't it make sense then to store the lineage as part of the
    representation:

    {

        "type": "sql",

        "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM
    events\nGROUP BY 2",

        "dialect": "spark",

        "lineage": [{

            "identifier": { "catalog": "iceberg", "namespace":
    "public", "table": "events"},

            "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"

        }]

    }

    Best wishes,

    Jan

    On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
    Benny, thank you so much for performing the experiment. Glad that
    using UUIDs as keys in the state map makes more sense now.

    For the issue with the view spec being restrictive, I agree and I
    have raised the concern on the view spec PR last year [1]. I
    think there is some area of improvement here. At the least, if it
    is restrictive, it should be explicitly stated. I will start a
    thread on how to approach the view spec. We may need to get more
    insight on the view spec before finalizing the MV spec, because
    view spec will determine if we should proceed with one lineage
    (with the implicitly assumed common catalog name), or with
    multiple lineages (one per engine or catalog name).

    [1]
    https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619

    Thanks,
    Walaa.


    On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com> wrote:

        Hi Walaa

        I did some testing with two different engines (Spark and
        Dremio) against the same Nessie catalog and created the
        attached materialized view metadata.json.  I see your point
        now about the SQL identifiers being tightly coupled to the
        engines. In the metadata JSON, spark refers to the catalog as
        "SparkNessie", whereas Dremio refers to the catalog as
        "LocalNessie".  So, this means that the fully qualified view
        and table identifiers are engine specific and Dremio can't
        lookup a Spark identifier and vice versa.

        *So, I think it does make sense now for the refresh-state to
        key off the UUIDs and not use engine specific identifiers.
        *This also means that the materization consumer will have to
        fully expand the query tree and basically diff the UUID +
        latest snapshot ids against the refresh state.  Would it ever
        make sense for the Iceberg Catalog to expose a bulk lookup
        API by UUID?

        As a side note, it seems that for a materialized view to work
        with multiple engines, the default-catalog and
        default-namespace can't be used unless both engines use the
        same catalog name which seems pretty restrictive to me.

        Thanks for the great discussions
        Benny


        On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa
        <wa.moust...@gmail.com> wrote:

            Jan, we definitely can store SQL identifiers of multiple
            representations in Approach 1.

            The takeaway is that SQL identifiers are highly coupled
            with engines, just like views. It makes sense to track
            both together for consistency.

            Thanks,
            Walaa.

            On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
            <jank...@mailbox.org.invalid>
            <mailto:jank...@mailbox.org.invalid> wrote:

                Walaa, thanks you for bringing up this use case. I
                think we need to keep in mind that we require
                identifiers to interface with the catalog. We cannot
                use UUIDs.

                Which means you also wouldn't be able to use Approach
                1 for your use case because you can't store the
                catalog names of multiple representations in the
                lineage. You would need to fallback to parsing the
                SQL for a particular representation and rebuilding
                the full query tree to obtain the identifiers.

                You could do the same for Approach 2. So I don't see
                why Approach 1 would yield any benefits.

                Thanks,

                Jan

                On 07.09.24 00:01, Steven Wu wrote:
                Benny, `default-catalog` is optional, while
                `default-namespace` is required.

                I will retract my comment on the `summary`. it
                indicates the engine that made the revision to the
                current view version. it doesn't really matter for
                multi-engine/representation support.

                On Fri, Sep 6, 2024 at 2:49 PM Benny Chow
                <btc...@gmail.com> wrote:

                    Steven - Ideally, the lineage is engine agnostic
                    so I'd hope it wouldn't have to be under a
                    specific representation.
                    Walaa - That's a serious concern...  If the same
                    catalog is aliased differently by two different
                    engines, then the basic view spec seems broken
                    to me since "default-namespace" includes the
                    catalog alias and is outside of the SQL
                    representation.  Does that mean for a view to be
                    interoperable, we require different engines to
                    use the same catalog name?

                    Thanks

                    On Fri, Sep 6, 2024 at 1:29 PM Steven Wu
                    <stevenz...@gmail.com> wrote:

                        Walaa, thanks for bringing up the
                        interesting case of multiple representations
                        (for different engines), which definitely
                        requires more discussion from the community.

                        When I am looking at the view spec, I am
                        seeing some conflict. "summary" field seems
                        meant for only one engine, while
                        "representations" support multiple engines.

                        "summary" : {
                        
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
                        : "Spark",
                        
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
                        : "3.3.2"
                        
<https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
                        
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
                        : [ {
                        
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
                        : "sql",
                        
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
                        : "SELECT\n COUNT(1), CAST(event_ts AS
                        DATE)\nFROM events\nGROUP BY 2",
                        
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
                        : "spark"
                        
<https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
                        ]

                        With multiple representations/engines, I
                        guess one engine will be responsible for the
                        storage table refresh and other engines are
                        read only. If we want to store the lineage
                        info in the view, it probably needs to be
                        part of the "representation" struct so that
                        each engine/representation stores its own
                        lineage info..
                        Who is to validate/ensure that the SQL
                        representation is actually semantically
                        identical (minus syntax differences across
                        engines)? I guess this responsibility is
                        left to the user who owns and manages the view.

Reply via email to