Benny, thank you so much for performing the experiment. Glad that
using UUIDs as keys in the state map makes more sense now.
For the issue with the view spec being restrictive, I agree and I
have raised the concern on the view spec PR last year [1]. I
think there is some area of improvement here. At the least, if it
is restrictive, it should be explicitly stated. I will start a
thread on how to approach the view spec. We may need to get more
insight on the view spec before finalizing the MV spec, because
view spec will determine if we should proceed with one lineage
(with the implicitly assumed common catalog name), or with
multiple lineages (one per engine or catalog name).
[1]
https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
Thanks,
Walaa.
On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com> wrote:
Hi Walaa
I did some testing with two different engines (Spark and
Dremio) against the same Nessie catalog and created the
attached materialized view metadata.json. I see your point
now about the SQL identifiers being tightly coupled to the
engines. In the metadata JSON, spark refers to the catalog as
"SparkNessie", whereas Dremio refers to the catalog as
"LocalNessie". So, this means that the fully qualified view
and table identifiers are engine specific and Dremio can't
lookup a Spark identifier and vice versa.
*So, I think it does make sense now for the refresh-state to
key off the UUIDs and not use engine specific identifiers.
*This also means that the materization consumer will have to
fully expand the query tree and basically diff the UUID +
latest snapshot ids against the refresh state. Would it ever
make sense for the Iceberg Catalog to expose a bulk lookup
API by UUID?
As a side note, it seems that for a materialized view to work
with multiple engines, the default-catalog and
default-namespace can't be used unless both engines use the
same catalog name which seems pretty restrictive to me.
Thanks for the great discussions
Benny
On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Jan, we definitely can store SQL identifiers of multiple
representations in Approach 1.
The takeaway is that SQL identifiers are highly coupled
with engines, just like views. It makes sense to track
both together for consistency.
Thanks,
Walaa.
On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid> wrote:
Walaa, thanks you for bringing up this use case. I
think we need to keep in mind that we require
identifiers to interface with the catalog. We cannot
use UUIDs.
Which means you also wouldn't be able to use Approach
1 for your use case because you can't store the
catalog names of multiple representations in the
lineage. You would need to fallback to parsing the
SQL for a particular representation and rebuilding
the full query tree to obtain the identifiers.
You could do the same for Approach 2. So I don't see
why Approach 1 would yield any benefits.
Thanks,
Jan
On 07.09.24 00:01, Steven Wu wrote:
Benny, `default-catalog` is optional, while
`default-namespace` is required.
I will retract my comment on the `summary`. it
indicates the engine that made the revision to the
current view version. it doesn't really matter for
multi-engine/representation support.
On Fri, Sep 6, 2024 at 2:49 PM Benny Chow
<btc...@gmail.com> wrote:
Steven - Ideally, the lineage is engine agnostic
so I'd hope it wouldn't have to be under a
specific representation.
Walaa - That's a serious concern... If the same
catalog is aliased differently by two different
engines, then the basic view spec seems broken
to me since "default-namespace" includes the
catalog alias and is outside of the SQL
representation. Does that mean for a view to be
interoperable, we require different engines to
use the same catalog name?
Thanks
On Fri, Sep 6, 2024 at 1:29 PM Steven Wu
<stevenz...@gmail.com> wrote:
Walaa, thanks for bringing up the
interesting case of multiple representations
(for different engines), which definitely
requires more discussion from the community.
When I am looking at the view spec, I am
seeing some conflict. "summary" field seems
meant for only one engine, while
"representations" support multiple engines.
"summary" : {
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
: "Spark",
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
: "3.3.2"
<https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
: [ {
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
: "sql",
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
: "SELECT\n COUNT(1), CAST(event_ts AS
DATE)\nFROM events\nGROUP BY 2",
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
: "spark"
<https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
]
With multiple representations/engines, I
guess one engine will be responsible for the
storage table refresh and other engines are
read only. If we want to store the lineage
info in the view, it probably needs to be
part of the "representation" struct so that
each engine/representation stores its own
lineage info..
Who is to validate/ensure that the SQL
representation is actually semantically
identical (minus syntax differences across
engines)? I guess this responsibility is
left to the user who owns and manages the view.