Benny, thank you so much for performing the experiment.
Glad that using UUIDs as keys in the state map makes
more sense now.
For the issue with the view spec being restrictive, I
agree and I have raised the concern on the view spec PR
last year [1]. I think there is some area of
improvement here. At the least, if it is restrictive,
it should be explicitly stated. I will start a thread
on how to approach the view spec. We may need to get
more insight on the view spec before finalizing the MV
spec, because view spec will determine if we should
proceed with one lineage (with the implicitly assumed
common catalog name), or with multiple lineages (one
per engine or catalog name).
[1]
https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
Thanks,
Walaa.
On Mon, Sep 9, 2024 at 3:28 AM Benny Chow
<btc...@gmail.com> wrote:
Hi Walaa
I did some testing with two different engines
(Spark and Dremio) against the same Nessie catalog
and created the attached materialized view
metadata.json. I see your point now about the SQL
identifiers being tightly coupled to the engines.
In the metadata JSON, spark refers to the catalog
as "SparkNessie", whereas Dremio refers to the
catalog as "LocalNessie". So, this means that the
fully qualified view and table identifiers are
engine specific and Dremio can't lookup a Spark
identifier and vice versa.
*So, I think it does make sense now for the
refresh-state to key off the UUIDs and not use
engine specific identifiers. *This also means that
the materization consumer will have to fully expand
the query tree and basically diff the UUID + latest
snapshot ids against the refresh state. Would it
ever make sense for the Iceberg Catalog to expose a
bulk lookup API by UUID?
As a side note, it seems that for a materialized
view to work with multiple engines, the
default-catalog and default-namespace can't be used
unless both engines use the same catalog name which
seems pretty restrictive to me.
Thanks for the great discussions
Benny
On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Jan, we definitely can store SQL identifiers of
multiple representations in Approach 1.
The takeaway is that SQL identifiers are highly
coupled with engines, just like views. It makes
sense to track both together for consistency.
Thanks,
Walaa.
On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid> wrote:
Walaa, thanks you for bringing up this use
case. I think we need to keep in mind that
we require identifiers to interface with
the catalog. We cannot use UUIDs.
Which means you also wouldn't be able to
use Approach 1 for your use case because
you can't store the catalog names of
multiple representations in the lineage.
You would need to fallback to parsing the
SQL for a particular representation and
rebuilding the full query tree to obtain
the identifiers.
You could do the same for Approach 2. So I
don't see why Approach 1 would yield any
benefits.
Thanks,
Jan
On 07.09.24 00:01, Steven Wu wrote:
Benny, `default-catalog` is optional,
while `default-namespace` is required.
I will retract my comment on the
`summary`. it indicates the engine that
made the revision to the current view
version. it doesn't really matter for
multi-engine/representation support.
On Fri, Sep 6, 2024 at 2:49 PM Benny Chow
<btc...@gmail.com> wrote:
Steven - Ideally, the lineage is
engine agnostic so I'd hope it
wouldn't have to be under a specific
representation.
Walaa - That's a serious concern...
If the same catalog is aliased
differently by two different engines,
then the basic view spec seems broken
to me since "default-namespace"
includes the catalog alias and is
outside of the SQL representation.
Does that mean for a view to be
interoperable, we require different
engines to use the same catalog name?
Thanks
On Fri, Sep 6, 2024 at 1:29 PM Steven
Wu <stevenz...@gmail.com> wrote:
Walaa, thanks for bringing up the
interesting case of multiple
representations (for different
engines), which definitely
requires more discussion from the
community.
When I am looking at the view
spec, I am seeing some conflict.
"summary" field seems meant for
only one engine, while
"representations" support multiple
engines.
"summary" : {
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
: "Spark",
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
: "3.3.2"
<https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
: [ {
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
: "sql",
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
: "SELECT\n COUNT(1),
CAST(event_ts AS DATE)\nFROM
events\nGROUP BY 2",
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
: "spark"
<https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
]
With multiple
representations/engines, I guess
one engine will be responsible for
the storage table refresh and
other engines are read only. If we
want to store the lineage info in
the view, it probably needs to be
part of the "representation"
struct so that each
engine/representation stores its
own lineage info..
Who is to validate/ensure that the
SQL representation is actually
semantically identical (minus
syntax differences across
engines)? I guess this
responsibility is left to the user
who owns and manages the view.