Benny, thank you so much for performing the experiment. Glad
that using UUIDs as keys in the state map makes more sense now.
For the issue with the view spec being restrictive, I agree
and I have raised the concern on the view spec PR last year
[1]. I think there is some area of improvement here. At the
least, if it is restrictive, it should be explicitly stated.
I will start a thread on how to approach the view spec. We
may need to get more insight on the view spec before
finalizing the MV spec, because view spec will determine if
we should proceed with one lineage (with the implicitly
assumed common catalog name), or with multiple lineages (one
per engine or catalog name).
[1]
https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
Thanks,
Walaa.
On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com>
wrote:
Hi Walaa
I did some testing with two different engines (Spark and
Dremio) against the same Nessie catalog and created the
attached materialized view metadata.json. I see your
point now about the SQL identifiers being tightly
coupled to the engines. In the metadata JSON, spark
refers to the catalog as "SparkNessie", whereas Dremio
refers to the catalog as "LocalNessie". So, this means
that the fully qualified view and table identifiers are
engine specific and Dremio can't lookup a Spark
identifier and vice versa.
*So, I think it does make sense now for the
refresh-state to key off the UUIDs and not use engine
specific identifiers. *This also means that the
materization consumer will have to fully expand the
query tree and basically diff the UUID + latest snapshot
ids against the refresh state. Would it ever make sense
for the Iceberg Catalog to expose a bulk lookup API by UUID?
As a side note, it seems that for a materialized view to
work with multiple engines, the default-catalog and
default-namespace can't be used unless both engines use
the same catalog name which seems pretty restrictive to me.
Thanks for the great discussions
Benny
On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Jan, we definitely can store SQL identifiers of
multiple representations in Approach 1.
The takeaway is that SQL identifiers are highly
coupled with engines, just like views. It makes
sense to track both together for consistency.
Thanks,
Walaa.
On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid> wrote:
Walaa, thanks you for bringing up this use case.
I think we need to keep in mind that we require
identifiers to interface with the catalog. We
cannot use UUIDs.
Which means you also wouldn't be able to use
Approach 1 for your use case because you can't
store the catalog names of multiple
representations in the lineage. You would need
to fallback to parsing the SQL for a particular
representation and rebuilding the full query
tree to obtain the identifiers.
You could do the same for Approach 2. So I don't
see why Approach 1 would yield any benefits.
Thanks,
Jan
On 07.09.24 00:01, Steven Wu wrote:
Benny, `default-catalog` is optional, while
`default-namespace` is required.
I will retract my comment on the `summary`. it
indicates the engine that made the revision to
the current view version. it doesn't really
matter for multi-engine/representation support.
On Fri, Sep 6, 2024 at 2:49 PM Benny Chow
<btc...@gmail.com> wrote:
Steven - Ideally, the lineage is engine
agnostic so I'd hope it wouldn't have to be
under a specific representation.
Walaa - That's a serious concern... If the
same catalog is aliased differently by two
different engines, then the basic view spec
seems broken to me since
"default-namespace" includes the catalog
alias and is outside of the SQL
representation. Does that mean for a view
to be interoperable, we require different
engines to use the same catalog name?
Thanks
On Fri, Sep 6, 2024 at 1:29 PM Steven Wu
<stevenz...@gmail.com> wrote:
Walaa, thanks for bringing up the
interesting case of multiple
representations (for different
engines), which definitely requires
more discussion from the community.
When I am looking at the view spec, I
am seeing some conflict. "summary"
field seems meant for only one engine,
while "representations" support
multiple engines.
"summary" : {
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
: "Spark",
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
: "3.3.2"
<https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
: [ {
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
: "sql",
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
: "SELECT\n COUNT(1), CAST(event_ts AS
DATE)\nFROM events\nGROUP BY 2",
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
: "spark"
<https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
]
With multiple representations/engines,
I guess one engine will be responsible
for the storage table refresh and other
engines are read only. If we want to
store the lineage info in the view, it
probably needs to be part of the
"representation" struct so that each
engine/representation stores its own
lineage info..
Who is to validate/ensure that the SQL
representation is actually semantically
identical (minus syntax differences
across engines)? I guess this
responsibility is left to the user who
owns and manages the view.