Hi Walaa,
It appears that you would like to maintain the lineage structure and not
revert to parsing the SQL to obtain identifiers.
Initially, one of the reasons for avoiding SQL parsing was to enable
consumers who don't understand the SQL dialect of any representation to
determine the freshness of the Materialized View (MV). However, with the
"catalog alias" issue, having an identifier for some representation is
insufficient, as the /catalog_name/ is unlikely to work for the
consumer. Therefore, supporting consumers that don't use a query engine
of any representation seems impossible.
Given this, parsing the SQL definition becomes a less significant
drawback, as the consumer must understand the dialect anyway. In fact,
simply parsing the SQL definition seems like a more robust and
straightforward solution than using a lineage for every representation.
I believe this is why Benny suggested reverting to SQL parsing, and I
agree with him.
Regarding the Storage table identifier: Its design as a
/PartialIdentifier/ with only namespace and name fields was intentional,
to avoid the /catalog_name/ issue.
Best regards,
Jan
On 19.09.24 23:16, Benny Chow wrote:
If Spark added the storage table identifier to the MV, I'm not sure
how it could also add a full identifier to the Dremio representation.
Spark doesn't know what name Dremio used for the catalog.
For the UX issue, I think Jan cleverly called it a "PartialIdentifier"
and not a "FullIdentifier" to indicate that catalog name is not even a
property of the identifier.
Requirement 3 is for the view's SQL. I'm not sure there is a very
strong use case to put the storage table into a different catalog than
the view. If we had an engine agnostic solution for it, I'm all for
it though...
Thanks
Benny
On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
I think the solution for the storage identifier might be shared
with the end state solution for the lineage. One could imagine a
"full identifier" can be used for the storage table; however, it
is "representation"-dependent (i.e., it changes according to
which representation it is part of, or rather which engine uses it).
Also, are we asking engines (or their Iceberg implementation) to
throw an exception if the full storage table identifier was
provided as part of the MV definition? Sounds like a not very
ideal UX. Note that it also conflicts with the spirit of
requirement #3.
Thanks,
Walaa.
On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <btc...@gmail.com> wrote:
Hi Jan
"PartialIdentifier" without the catalog name sounds good to
me. The storage table and MV have to be in the same catalog.
That would be a good fifth requirement to add to the list.
Thanks
Benny
On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul
<jank...@mailbox.org.invalid> wrote:
Cool, I guess it's easier to resolve these kind of things
when talking in person.
I agree with your requirements and the conclusion to use a
map from UUID to snapshot-id/version-id as the
refresh-state, as well as dropping the lineage in favor to
just re-parsing the SQL query. This gets us around the
"catalog alias" issue.
And I'm also OK with every engine requiring their own
representation to use the MV.
There is still the issue with the identifier of the
storage table and its catalog_name. Should we use an
"PartialIdentifier" with a namespace and a name field,
like so:
{
namespace: ["bronze"],
name: "lineitem"
}
And require the storage table to be in the same catalog as
the MV itself?
Thanks,
Jan
On 19.09.24 00:50, Benny Chow wrote:
Steven and I met up yesterday at the Seattle Iceberg
meetup and we got to talking about the "catalog alias"
issue. He described it as an annoying problem =p
I think there are some key requirements we need to support:
1. Different engines can produce and consume shared MVs
with freshness validation.
2. We cannot force different engines to standardize on
the alias they use for the catalog.
3. We cannot force different SQL representations to
exclude catalog names from table identifiers or not use
fully qualified table names.
4. MV SQL can join tables and views from multiple
catalogs -> Inevitable with Nessie, Polaris, Unity,
Tabular and others...
The producing engine has to save refresh state
information to let consuming engine know that table X is
at what snapshot at the time of materialization. The
only way to identify this table across different catalog
names is to use the cross catalog, globally unique UUID.
I think our only option is to have the refresh state map
UUID to snapshot ids and view version ids.
Assuming the above is how we store the refresh state, how
does the consuming engine determine the current snapshot
ids? The consuming engine will have to fully expand the
query tree at which point it will have the UUIDs as well
as the latest snapshot ids/view versions. This can then
be diffed against the materialization refresh state to
determine freshness. There isn't a need to store the
view lineage information to map from UUID to the consumer
specific identifier so that the consumer can then call
back into the catalog with that identifier to get the
latest state. The consuming engine might as well just
re-parse the SQL and expand the query.
Personally, I'm OK with requiring that an engine must
have its own SQL representation in order to use the MV.
To me, being able to fulfill the key requirements above
is much more important.
Thanks
Benny
On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid> wrote:
How about we make the /catalog_name field/ of the
identifier optional? If the field is missing, it
references a table/view in the same catalog. If it is
present it has to be an engine agnostic catalog name.
Shouldn't the catalog_names from the REST catalog
spec be engine agnostic?
I was wondering, is there no way to prescribe a
catalog_name in Spark or Dremio? What do you do if
you include two Nessie catalogs? They can't both be
called LocalNessie.
Thanks,
Jan
On 14.09.24 01:23, Benny Chow wrote:
The main reason for putting the lineage into the
view is so that "another" engine can enumerate out
the tables in the view without needing to parse any
SQL. But, if we put the lineage under the SQL
representation with engine specific catalog names,
the "other" engine is not going to be able to use
those identifiers to look up the tables. The
"other" engine can only lookup those identifiers
using its engine specific catalog name. It may be
possible to enumerate the tables at the view version
level ONLY if those identifiers don't include the
catalog name. However, if you have a view with a
cross catalog join, then the tables coming from the
other catalog have to be fully qualified. But then
the problem is that each engine will also alias the
other catalog differently too.
So, I think to summarize *multi-engine* view
interoperability:
* default-catalog can't be specified
* default-namespace can be specified
* View SQL can only references tables/views from
the same catalog
I think these are reasonable constraints for
multi-engine use cases. If reasonable, for MVs,
then the storage table, refresh-state and lineage
(at the view version level), could all be based on
*engine agnostic* identifiers without the catalog
name. The MV and storage table would have to be in
the same catalog.
Thanks
Benny
On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid> wrote:
Hi,
regarding our recent discussion on table
identifiers with respect to different
catalog_names with different query engines. We
have the same problem when we want to reference
the storage table from the common view. *If we
include the catalog_name as part of the
identifier, different query engines might not be
able to load the storage table.
*
We could enforce that every storage table has to
be part of the same catalog as the main view.
This way an identifier without the catalog_name
would be enough to point to the correct storage
table.
What are your thoughts on this?
Best wishes,
Jan
On 11.09.24 16:05, Walaa Eldin Moustafa wrote:
I think this type of discussion is exactly what
motivates a clarification in the view spec so
that we can resolve MV lineage. Will create
separate thread for view spec clarification.
Following up on Jan’s point, yes I agree in
order to support catalog name, it should be at
the representation level, but catalog name does
not really depend on the “dialect” but rather
on the “engine”; hence the discussion becomes a
little more involved.
Thanks,
Walaa.
On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid> wrote:
Hi Benny,
I think that identifiers only being defined
for a certain representation is exactly
what we want. Each representation can
define their own identifiers that then map
to an UUID. This way the "catalog_name" of
the identifier for a "Spark" dialect can be
different then for a "Dremio" dialect.
The important part is that we still have a
list of identifiers for each representation
that we can use with the catalog to obtain
the state of the source tables.
Best wishes,
Jan
On 11.09.24 01:33, Benny Chow wrote:
Hi Walaa, I don't think the current view
spec implicitly assumes a common catalog
name between engines. I tested this by
not specifying the default-catalog and
both engines could look up the correct
table under the shared default-namespace
even when each engine uses a different
catalog name.
Hi Jan, I think the issue with putting the
lineage as part of the representation is
that that identifier only makes sense for
that representation's engine. In your
example, the catalog aliased as "iceberg"
in spark is going to have a different name
in Dremio or Trino.
IMO, if we are to store a lineage for a
view, it should consist of something
engine agnostic like the table/view UUIDs.
This would be stored at the view version
level and not the representation level. I
think as we get into more of these
multi-engine, multi-catalog use cases for
views, the Iceberg Catalog is going to
need to do a better job at handling CRUD
by UUID instead of engine specific
identifiers. Another scenario we need to
think through is a view that joins tables
from two different catalogs. How would we
represent the lineage for that in an
engine agnostic way?
Thanks
Benny
On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid> wrote:
Thanks Walaa and Benny for clarifying
the problem. I think I have a better
understanding now. Sorry for being a
bit stubborn before.
Wouldn't it make sense then to store
the lineage as part of the representation:
{
"type": "sql",
"sql": "SELECT\n COUNT(1),
CAST(event_ts AS DATE)\nFROM
events\nGROUP BY 2",
"dialect": "spark",
"lineage": [{
"identifier": { "catalog": "iceberg",
"namespace": "public", "table":
"events"},
"uuid":
"fa6506c3-7681-40c8-86dc-e36561f83385"
}]
}
Best wishes,
Jan
On 09.09.24 11:59, Walaa Eldin
Moustafa wrote:
Benny, thank you so much for
performing the experiment. Glad that
using UUIDs as keys in the state map
makes more sense now.
For the issue with the view spec
being restrictive, I agree and I have
raised the concern on the view spec
PR last year [1]. I think there is
some area of improvement here. At the
least, if it is restrictive, it
should be explicitly stated. I will
start a thread on how to approach the
view spec. We may need to get more
insight on the view spec before
finalizing the MV spec, because view
spec will determine if we should
proceed with one lineage (with the
implicitly assumed common catalog
name), or with multiple lineages (one
per engine or catalog name).
[1]
https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
Thanks,
Walaa.
On Mon, Sep 9, 2024 at 3:28 AM Benny
Chow <btc...@gmail.com> wrote:
Hi Walaa
I did some testing with two
different engines (Spark and
Dremio) against the same Nessie
catalog and created the attached
materialized view metadata.json.
I see your point now about the
SQL identifiers being tightly
coupled to the engines. In the
metadata JSON, spark refers to
the catalog as "SparkNessie",
whereas Dremio refers to the
catalog as "LocalNessie". So,
this means that the fully
qualified view and table
identifiers are engine specific
and Dremio can't lookup a Spark
identifier and vice versa.
*So, I think it does make sense
now for the refresh-state to key
off the UUIDs and not use engine
specific identifiers. *This also
means that the materization
consumer will have to fully
expand the query tree and
basically diff the UUID + latest
snapshot ids against the refresh
state. Would it ever make sense
for the Iceberg Catalog to expose
a bulk lookup API by UUID?
As a side note, it seems that for
a materialized view to work with
multiple engines, the
default-catalog and
default-namespace can't be used
unless both engines use the same
catalog name which seems pretty
restrictive to me.
Thanks for the great discussions
Benny
On Sat, Sep 7, 2024 at 2:49 AM
Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Jan, we definitely can store
SQL identifiers of multiple
representations in Approach 1.
The takeaway is that SQL
identifiers are highly
coupled with engines, just
like views. It makes sense to
track both together for
consistency.
Thanks,
Walaa.
On Sat, Sep 7, 2024 at
8:15 AM Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid>
wrote:
Walaa, thanks you for
bringing up this use
case. I think we need to
keep in mind that we
require identifiers to
interface with the
catalog. We cannot use
UUIDs.
Which means you also
wouldn't be able to use
Approach 1 for your use
case because you can't
store the catalog names
of multiple
representations in the
lineage. You would need
to fallback to parsing
the SQL for a particular
representation and
rebuilding the full query
tree to obtain the
identifiers.
You could do the same for
Approach 2. So I don't
see why Approach 1 would
yield any benefits.
Thanks,
Jan
On 07.09.24 00:01, Steven
Wu wrote:
Benny, `default-catalog`
is optional, while
`default-namespace` is
required.
I will retract my
comment on the
`summary`. it indicates
the engine that made the
revision to the current
view version. it doesn't
really matter for
multi-engine/representation
support.
On Fri, Sep 6, 2024 at
2:49 PM Benny Chow
<btc...@gmail.com> wrote:
Steven - Ideally,
the lineage is
engine agnostic so
I'd hope it wouldn't
have to be under a
specific representation.
Walaa - That's a
serious concern...
If the same catalog
is aliased
differently by two
different engines,
then the basic view
spec seems broken to
me since
"default-namespace"
includes the catalog
alias and is outside
of the SQL
representation. Does
that mean for a view
to be interoperable,
we require different
engines to use the
same catalog name?
Thanks
On Fri, Sep 6, 2024
at 1:29 PM Steven Wu
<stevenz...@gmail.com>
wrote:
Walaa, thanks
for bringing up
the interesting
case of multiple
representations
(for different
engines), which
definitely
requires more
discussion from
the community.
When I am
looking at the
view spec, I am
seeing some
conflict.
"summary" field
seems meant for
only one engine,
while
"representations"
support multiple
engines.
"summary" : {
<https://iceberg.apache.org/view-spec/#__codelineno-5-16>"engine-name"
: "Spark",
<https://iceberg.apache.org/view-spec/#__codelineno-5-17>"engineVersion"
: "3.3.2"
<https://iceberg.apache.org/view-spec/#__codelineno-5-18>},
<https://iceberg.apache.org/view-spec/#__codelineno-5-19>"representations"
: [ {
<https://iceberg.apache.org/view-spec/#__codelineno-5-20>"type"
: "sql",
<https://iceberg.apache.org/view-spec/#__codelineno-5-21>"sql"
: "SELECT\n
COUNT(1),
CAST(event_ts AS
DATE)\nFROM
events\nGROUP BY
2",
<https://iceberg.apache.org/view-spec/#__codelineno-5-22>"dialect"
: "spark"
<https://iceberg.apache.org/view-spec/#__codelineno-5-23>}
]
With multiple
representations/engines,
I guess one
engine will be
responsible for
the storage
table refresh
and other
engines are read
only. If we want
to store the
lineage info in
the view, it
probably
needs to be part
of the
"representation"
struct so that
each
engine/representation
stores its own
lineage info..
Who is to
validate/ensure
that the SQL
representation
is actually
semantically
identical (minus
syntax
differences
across engines)?
I guess this
responsibility
is left to the
user who owns
and manages the
view.