Thanks Walaa and Benny for clarifying the problem. I think I have a better understanding now. Sorry for being a bit stubborn before.

Wouldn't it make sense then to store the lineage as part of the representation:

{

    "type": "sql",

    "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP BY 2",

    "dialect": "spark",

    "lineage": [{

        "identifier": { "catalog": "iceberg", "namespace": "public", "table": "events"},

        "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"

    }]

}

Best wishes,

Jan

On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
Benny, thank you so much for performing the experiment. Glad that using UUIDs as keys in the state map makes more sense now.

For the issue with the view spec being restrictive, I agree and I have raised the concern on the view spec PR last year [1]. I think there is some area of improvement here. At the least, if it is restrictive, it should be explicitly stated. I will start a thread on how to approach the view spec. We may need to get more insight on the view spec before finalizing the MV spec, because view spec will determine if we should proceed with one lineage (with the implicitly assumed common catalog name), or with multiple lineages (one per engine or catalog name).

[1] https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619

Thanks,
Walaa.


On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com> wrote:

    Hi Walaa

    I did some testing with two different engines (Spark and Dremio)
    against the same Nessie catalog and created the attached
    materialized view metadata.json.  I see your point now about the
    SQL identifiers being tightly coupled to the engines.  In the
    metadata JSON, spark refers to the catalog as "SparkNessie",
    whereas Dremio refers to the catalog as "LocalNessie".  So, this
    means that the fully qualified view and table identifiers are
    engine specific and Dremio can't lookup a Spark identifier and
    vice versa.

    *So, I think it does make sense now for the refresh-state to key
    off the UUIDs and not use engine specific identifiers. *This also
    means that the materization consumer will have to fully expand the
    query tree and basically diff the UUID + latest snapshot ids
    against the refresh state.  Would it ever make sense for the
    Iceberg Catalog to expose a bulk lookup API by UUID?

    As a side note, it seems that for a materialized view to work with
    multiple engines, the default-catalog and default-namespace can't
    be used unless both engines use the same catalog name which seems
    pretty restrictive to me.

    Thanks for the great discussions
    Benny


    On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa
    <wa.moust...@gmail.com> wrote:

        Jan, we definitely can store SQL identifiers of multiple
        representations in Approach 1.

        The takeaway is that SQL identifiers are highly coupled with
        engines, just like views. It makes sense to track both
        together for consistency.

        Thanks,
        Walaa.

        On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
        <jank...@mailbox.org.invalid> wrote:

            Walaa, thanks you for bringing up this use case. I think
            we need to keep in mind that we require identifiers to
            interface with the catalog. We cannot use UUIDs.

            Which means you also wouldn't be able to use Approach 1
            for your use case because you can't store the catalog
            names of multiple representations in the lineage. You
            would need to fallback to parsing the SQL for a particular
            representation and rebuilding the full query tree to
            obtain the identifiers.

            You could do the same for Approach 2. So I don't see why
            Approach 1 would yield any benefits.

            Thanks,

            Jan

            On 07.09.24 00:01, Steven Wu wrote:
            Benny, `default-catalog` is optional, while
            `default-namespace` is required.

            I will retract my comment on the `summary`. it indicates
            the engine that made the revision to the current view
            version. it doesn't really matter for
            multi-engine/representation support.

            On Fri, Sep 6, 2024 at 2:49 PM Benny Chow
            <btc...@gmail.com> wrote:

                Steven - Ideally, the lineage is engine agnostic so
                I'd hope it wouldn't have to be under a specific
                representation.
                Walaa - That's a serious concern... If the same
                catalog is aliased differently by two different
                engines, then the basic view spec seems broken to me
                since "default-namespace" includes the catalog alias
                and is outside of the SQL representation.  Does that
                mean for a view to be interoperable, we require
                different engines to use the same catalog name?

                Thanks

                On Fri, Sep 6, 2024 at 1:29 PM Steven Wu
                <stevenz...@gmail.com> wrote:

                    Walaa, thanks for bringing up the interesting
                    case of multiple representations (for different
                    engines), which definitely requires more
                    discussion from the community.

                    When I am looking at the view spec, I am seeing
                    some conflict. "summary" field seems meant for
                    only one engine, while "representations" support
                    multiple engines.

                    "summary" : {
                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
                    : "Spark",
                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
                    : "3.3.2"
                    <https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
                    : [ {
                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
                    : "sql",
                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
                    : "SELECT\n COUNT(1), CAST(event_ts AS
                    DATE)\nFROM events\nGROUP BY 2",
                    
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
                    : "spark"
                    <https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
                    ]

                    With multiple representations/engines, I guess
                    one engine will be responsible for the storage
                    table refresh and other engines are read only. If
                    we want to store the lineage info in the view, it
                    probably needs to be part of the "representation"
                    struct so that each engine/representation stores
                    its own lineage info..
                    Who is to validate/ensure that the SQL
                    representation is actually semantically identical
                    (minus syntax differences across engines)? I
                    guess this responsibility is left to the user who
                    owns and manages the view.

Reply via email to