Thanks Walaa and Benny for clarifying the problem. I think I have a
better understanding now. Sorry for being a bit stubborn before.
Wouldn't it make sense then to store the lineage as part of the
representation:
{
"type": "sql",
"sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM
events\nGROUP BY 2",
"dialect": "spark",
"lineage": [{
"identifier": { "catalog": "iceberg", "namespace": "public",
"table": "events"},
"uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"
}]
}
Best wishes,
Jan
On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
Benny, thank you so much for performing the experiment. Glad that
using UUIDs as keys in the state map makes more sense now.
For the issue with the view spec being restrictive, I agree and I have
raised the concern on the view spec PR last year [1]. I think there is
some area of improvement here. At the least, if it is restrictive, it
should be explicitly stated. I will start a thread on how to approach
the view spec. We may need to get more insight on the view spec before
finalizing the MV spec, because view spec will determine if we should
proceed with one lineage (with the implicitly assumed common catalog
name), or with multiple lineages (one per engine or catalog name).
[1] https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
Thanks,
Walaa.
On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com> wrote:
Hi Walaa
I did some testing with two different engines (Spark and Dremio)
against the same Nessie catalog and created the attached
materialized view metadata.json. I see your point now about the
SQL identifiers being tightly coupled to the engines. In the
metadata JSON, spark refers to the catalog as "SparkNessie",
whereas Dremio refers to the catalog as "LocalNessie". So, this
means that the fully qualified view and table identifiers are
engine specific and Dremio can't lookup a Spark identifier and
vice versa.
*So, I think it does make sense now for the refresh-state to key
off the UUIDs and not use engine specific identifiers. *This also
means that the materization consumer will have to fully expand the
query tree and basically diff the UUID + latest snapshot ids
against the refresh state. Would it ever make sense for the
Iceberg Catalog to expose a bulk lookup API by UUID?
As a side note, it seems that for a materialized view to work with
multiple engines, the default-catalog and default-namespace can't
be used unless both engines use the same catalog name which seems
pretty restrictive to me.
Thanks for the great discussions
Benny
On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Jan, we definitely can store SQL identifiers of multiple
representations in Approach 1.
The takeaway is that SQL identifiers are highly coupled with
engines, just like views. It makes sense to track both
together for consistency.
Thanks,
Walaa.
On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
<jank...@mailbox.org.invalid> wrote:
Walaa, thanks you for bringing up this use case. I think
we need to keep in mind that we require identifiers to
interface with the catalog. We cannot use UUIDs.
Which means you also wouldn't be able to use Approach 1
for your use case because you can't store the catalog
names of multiple representations in the lineage. You
would need to fallback to parsing the SQL for a particular
representation and rebuilding the full query tree to
obtain the identifiers.
You could do the same for Approach 2. So I don't see why
Approach 1 would yield any benefits.
Thanks,
Jan
On 07.09.24 00:01, Steven Wu wrote:
Benny, `default-catalog` is optional, while
`default-namespace` is required.
I will retract my comment on the `summary`. it indicates
the engine that made the revision to the current view
version. it doesn't really matter for
multi-engine/representation support.
On Fri, Sep 6, 2024 at 2:49 PM Benny Chow
<btc...@gmail.com> wrote:
Steven - Ideally, the lineage is engine agnostic so
I'd hope it wouldn't have to be under a specific
representation.
Walaa - That's a serious concern... If the same
catalog is aliased differently by two different
engines, then the basic view spec seems broken to me
since "default-namespace" includes the catalog alias
and is outside of the SQL representation. Does that
mean for a view to be interoperable, we require
different engines to use the same catalog name?
Thanks
On Fri, Sep 6, 2024 at 1:29 PM Steven Wu
<stevenz...@gmail.com> wrote:
Walaa, thanks for bringing up the interesting
case of multiple representations (for different
engines), which definitely requires more
discussion from the community.
When I am looking at the view spec, I am seeing
some conflict. "summary" field seems meant for
only one engine, while "representations" support
multiple engines.
"summary" : {
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
: "Spark",
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
: "3.3.2"
<https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
: [ {
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
: "sql",
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
: "SELECT\n COUNT(1), CAST(event_ts AS
DATE)\nFROM events\nGROUP BY 2",
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
: "spark"
<https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
]
With multiple representations/engines, I guess
one engine will be responsible for the storage
table refresh and other engines are read only. If
we want to store the lineage info in the view, it
probably needs to be part of the "representation"
struct so that each engine/representation stores
its own lineage info..
Who is to validate/ensure that the SQL
representation is actually semantically identical
(minus syntax differences across engines)? I
guess this responsibility is left to the user who
owns and manages the view.