I'm also supportive of Jan's PR + adding the catalog, namespace and table
(all optional) for readability in the refresh state.
Let's keep the big picture in mind that we have achieved a spec that
enables engine A to produce a MV and engine B to consume it. This can
significantly reduce the cost t
Thanks Steven. I feel we have not completely agreed on the PR yet. I also
think it will be a dependency to move forward here. Let me address the
comments and get further feedback.
On Sun, Nov 10, 2024 at 10:46 AM Steven Wu wrote:
> I also feel good about moving forward with Jan's latest descrip
Let me recap and see if we are on the same page.
1) we have some consensus on the refresh-state on the storage table. it
would contain these fields: UUID, snapshot-id (for table) or version-id
(for view), namespace, table.
2) there is no consensus if lineage info is needed in the view definition.
w
Hi Jan
Both sound good to me. (No lineage in views and assumption about UUIDs
being unique across catalogs). I hope we get to voting soon on your PR..
Thanks
Benny
On Sat, Sep 28, 2024 at 10:52 AM Jan Kaul
wrote:
> Hi Benny,
>
> thanks for bringing up the UUID issue. It is my understanding t
>> storing the lineage is an optimization that can avoid
recomputation/re-parsing.
I don't think having the lineage is optimizing much over re-parsing the
SQL. The most expensive part of SQL parsing is catalog access which has to
happen with lineage anyway. Once the planner has the query tree, i
I agree that it is reasonable to assume/restrict view definition and
storage table in the same catalog. Hence the storage table reference in the
view metadata can include only namespace and table (excluding the engine
dependent catalog name/alias).
Regarding the question of having lineage metadata
Hi Walaa,
It appears that you would like to maintain the lineage structure and not
revert to parsing the SQL to obtain identifiers.
Initially, one of the reasons for avoiding SQL parsing was to enable
consumers who don't understand the SQL dialect of any representation to
determine the fresh
If Spark added the storage table identifier to the MV, I'm not sure how it
could also add a full identifier to the Dremio representation.
Spark doesn't know what name Dremio used for the catalog.
For the UX issue, I think Jan cleverly called it a "PartialIdentifier" and
not a "FullIdentifier" to i
I think the solution for the storage identifier might be shared with the
end state solution for the lineage. One could imagine a "full identifier"
can be used for the storage table; however, it is
"representation"-dependent (i.e., it changes according to
which representation it is part of, or rathe
Hi Jan
"PartialIdentifier" without the catalog name sounds good to me. The
storage table and MV have to be in the same catalog. That would be a good
fifth requirement to add to the list.
Thanks
Benny
On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul
wrote:
> Cool, I guess it's easier to resolve these
Cool, I guess it's easier to resolve these kind of things when talking
in person.
I agree with your requirements and the conclusion to use a map from UUID
to snapshot-id/version-id as the refresh-state, as well as dropping the
lineage in favor to just re-parsing the SQL query. This gets us aro
Steven and I met up yesterday at the Seattle Iceberg meetup and we got to
talking about the "catalog alias" issue. He described it as an annoying
problem =p
I think there are some key requirements we need to support:
1. Different engines can produce and consume shared MVs with freshness
validati
How about we make the /catalog_name field/ of the identifier optional?
If the field is missing, it references a table/view in the same catalog.
If it is present it has to be an engine agnostic catalog name. Shouldn't
the catalog_names from the REST catalog spec be engine agnostic?
I was wonder
The main reason for putting the lineage into the view is so that "another"
engine can enumerate out the tables in the view without needing to parse
any SQL. But, if we put the lineage under the SQL representation with
engine specific catalog names, the "other" engine is not going to be able
to use
Hi,
regarding our recent discussion on table identifiers with respect to
different catalog_names with different query engines. We have the same
problem when we want to reference the storage table from the common
view. *If we include the catalog_name as part of the identifier,
different query
I think this type of discussion is exactly what motivates a clarification
in the view spec so that we can resolve MV lineage. Will create separate
thread for view spec clarification.
Following up on Jan’s point, yes I agree in order to support catalog name,
it should be at the representation level
Hi Benny,
I think that identifiers only being defined for a certain representation
is exactly what we want. Each representation can define their own
identifiers that then map to an UUID. This way the "catalog_name" of the
identifier for a "Spark" dialect can be different then for a "Dremio"
d
Hi Walaa, I don't think the current view spec implicitly assumes a common
catalog name between engines. I tested this by not specifying the
default-catalog and both engines could look up the correct table under the
shared default-namespace even when each engine uses a different catalog
name.
Hi J
Thanks Walaa and Benny for clarifying the problem. I think I have a
better understanding now. Sorry for being a bit stubborn before.
Wouldn't it make sense then to store the lineage as part of the
representation:
{
"type": "sql",
"sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFR
Hi Walaa
I did some testing with two different engines (Spark and Dremio) against
the same Nessie catalog and created the attached materialized view
metadata.json. I see your point now about the SQL identifiers being
tightly coupled to the engines. In the metadata JSON, spark refers to the
catal
Hi Steven
Yes, I definitely think #2 is easier and cleaner for both reader and writer
and that lineage is a separate feature all together. There's no need to
couple materialization state with view lineage.
The other way to look at helping to decide between the two options is
what is the most per
Hi Dan, Thanks for the feedback.
For Drop/Recreate, as long as we verify the UUID is still current (i.e.,
the table in the catalog still maintains the same UUID as in the one in the
lineage), we can assume the table is not dropped and recreated. Once it is
dropped or recreated, the UUID in the cat
Hi all,I would like to mention that Walaa's document doesn't mention any downsides of Approach #1 in the comparison.Approach #1 has downsides when views are referenced in the materialized view query. These problems arise because only direct children are stored in the lineage. This leads to:- The li
I'm generally in favor of approach #1 with UUID in lineage.
I think it's helpful to know if the underlying table changes (e.g.
identifier remains the same, but the table was changed). However, I'm not
sure what the behavior would be in that case. Any refresh at that point
would not be able to pr
Hi Steven,
Thanks! I see the PR now introduces UUID in the state information. That is
great progress. I still have slight preference on where to place UUID (in
lineage vs state), which I summarized in this doc [1] as promised. I wrote
the doc before UUID was added in the PR, so it still compares w
Walaa, for the listed discussion points, how should we move forward? should
we have another MV sync meeting?
BTW, Jan's latest spec PR addressed my comment on UUID.
On Mon, Sep 2, 2024 at 4:35 PM Walaa Eldin Moustafa
wrote:
> Hi Jan,
>
> I think we need further discussion for a few reasons:
>
>
Hi Jan,
I think we need further discussion for a few reasons:
* Semantic issues are significant, especially in a spec. They could create
gaps that are very hard to fix in the future. I would rather spend more
time designing them properly than fix the gaps with band-aid solutions in
the future.
*
Hi Walaa,
in my opinion the UUID vs table identifier discussion comes down to the
question: do we introduce table identifiers in an opaque struct in the
table snapshot summary or do we accept technical drawbacks?
Using UUIDs with a lineage struct in the view metadata leads to one of
the foll
Hi Jan,
I think we need to close the discussion on the UUID vs table identifier
options and possibly cast a vote before having a productive discussion on
the PR. I did not get a chance yet to post the document on the UUID vs
table identifier discussion. I will do that by next week.
Thanks,
Walaa.
Hi all,
to move the Iceberg Materialzied View Proposal forward, I created a PR
(https://github.com/apache/iceberg/pull/11041) that adds a section on
Materialized Views to the View Spec. I hope we can resolve any remaining
questions there, before we can start the voting process for the Proposal
30 matches
Mail list logo