Hi Jan, I do not think this is feasible because it assumes the catalog
identifiers do not collide across catalogs. Anyways, let us not over
engineer this use case. As I mentioned, it was for illustration purposes.
Since the discussion moved from “UUIDs vs Sequence numbers” to “UUIDs vs
catalog tab
Theoretically, we could have multiple catalogs each with different table
name entries but referring to the same Iceberg table metadata, and hence
same UUIDs (view metadata cannot be shared since they are strongly bound to
the catalog identifiers). I understand this is not an everyday scenario but
i
Hi Walaa, I personally don't see a semantic issue with putting the table
identifiers in the refresh state. The purpose of the refresh state is to
basically take a snapshot of the table and view versions at the time of
materialization. Directly using table identifiers seems pretty natural to
me.
Hi Micah, it is mostly about the typical results of denormalization such as
data consistency, management complexity, integrity, etc. However, as
mentioned earlier, the main reason would be the semantic gap around using
catalog table identifiers as a concept in the table (more specifically
snapshot
>
> Thanks Micah, for the latter, I meant the type of denormalization of
> repeating a 3-part name as opposed to using an ID.
Is the concern here just metadata size or something else? For size I think
if this is really anticipated to be a problem that it is likely for the
state map in general, a
Thanks Micah, for the latter, I meant the type of denormalization of
repeating a 3-part name as opposed to using an ID.
On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield
wrote:
> However, this still does not address the semantic issue which is more
>> fundamental in my opinion. The Iceberg table s
>
> However, this still does not address the semantic issue which is more
> fundamental in my opinion. The Iceberg table spec is not aware of catalog
> table identifiers and this use will be the first break of this abstraction.
IIUC, based on Jan's comments, we are not going to modify the table
s
That is right. I agree that in the case of using catalog identifiers in
state information, using them in lineage information would be a
nice-to-have and not a requirement.
However, this still does not address the semantic issue which is more
fundamental in my opinion. The Iceberg table spec is not
Hi Walaa,I would argue that for the refresh operation the query engine has to parse the query and then somehow execute it. For a full refresh it will directly execute the query and for a incremental refresh it will execute a modified version. Therefore it has to fully expand the query tree.Best wis
Thanks Jan for the summary.
For this point:
> For a refresh operation the query engine has to parse the SQL and fully
expand the lineage with it's children anyway. So the lineage is not
strictly required.
If the lineage is provided at creation time by the respective engine, the
refresh operatio
As the table I created is not properly shown in the mailing list I'll
reformat the summary of the different drawbacks again:
Drawbacks of (no lineage, refresh-state key = identifier):
- introduces catalog identifiers into table metadata (#4)
- query engine has to expand lineage at refresh time
Hi,
Thanks Micah for clearly stating the requirements. I think this gives
better clarity for the discussion.
It seems like we don't have a solution that satisfies all requirements
at once. So we would need to choose which has the fewest drawbacks.
I would like to summarize the different dra
The option of using catalog identifiers in the state map still requires
keeping lineage information in the view because REFRESH MV needs the latest
fully expanded children (which could have changed from the set of children
currently in the state map), without reparsing the view tree. Therefore,
cat
>
> I think given the constraint that catalog lookup has to be by identifier
> and not UUID, I'd prefer using identifier in the refresh state. If we use
> identifiers, we can directly parallelize the catalog calls to fetch the
> latest state. If we use UUID, the engine has to go back to the MV an
>
> I do not think 3 and 4 are at odds with each other (for example
> maintaining both lineage map and state map through UUID can achieve both).
I agree, I should have been more clear that #5 (limiting new view versions)
also comes into play. If UUID is used in lineage as part of the view spec,
I think given the constraint that catalog lookup has to be by identifier
and not UUID, I'd prefer using identifier in the refresh state. If we use
identifiers, we can directly parallelize the catalog calls to fetch the
latest state. If we use UUID, the engine has to go back to the MV and
possibly
Thanks Jan, Micah, and Karuppayya for chiming in.
I do not think 3 and 4 are at odds with each other (for example
maintaining both lineage map and state map through UUID can achieve both).
Also, I do not think we can drop the lineage map since in many catalogs,
the only lookup method is by the cat
+1 to storing the refresh state as a map of UUIDs to snapshot IDs, and
deferring the inclusion of lineage to a future iteration.(like Micha
mentioned)
This would greatly simplify the current design.
Also in terms of identifiers to use(UUID or catalog identifier) for the
refresh state
We will not b
I think it might be worth restating perceived requirements and making sure
there is alignment on them.
If I am reading correctly, I think the following are perceived requirements:
1. An engine must be able to unambiguously detect that an underlying
queried entity has changed or not via metadata to
If we go with either UUID or Table Identifier + VersionID/SnapshotId in the
refresh state, then this list is fully expanded already. So, to validate
the freshness of a materialization, the engine doesn't even need to look at
the view lineage. IMO, the view lineage is nice to have but not a
necess
Hi all,
I would like to reemphasize the purpose of the refresh-state for
materialized views. The purpose is to determine if the precomputed data
is fresh, stale or invalid. For that the current snapshot-id of every
table in the query tree has to be fetched from the catalog by using its
full i
Thanks Benny. For refs, I am +1 to represent them as UUID + optional ref,
although we can iterate ohe exact JSON structure (e.g., another option is
splitting for (UUID) state from (UUID + ref) state into two separate
higher-level fields).
Generally agree on REFRESH VIEW strategy could be up to the
I'd like to hear Jan's feedback on using UUID and normalizing the view
lineage. I'm on board with this change.
I updated the fully spec'd out example using UUID and a normalized view
linage:
https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit#heading=h.o6yn2lnpxo
Hi Everyone,
Just a follow up on this thread. Thanks Benny and Micah for the discussion
on the doc [1]. We have been converging more on using UUIDs from the
discussion. The only open question was related to UUIDs (of underlying
views/tables) being stale upon a REPLACE (or DROP and CREATE) operatio
Thanks Benny! We discussed this option during the meeting but we did not
prefer it because we did not want to leak the SQL identifiers to the
storage table since SQL identifiers are view concepts and fit better with
the view.
Thanks,
Walaa.
On Thu, Aug 8, 2024 at 4:12 PM Benny Chow wrote:
> May
Maybe a third option is to decouple the view lineage and materialization
state.
The view lineage can just list out the SQL identifiers+ref... we can still
decide whether this is just direct children or fully expanded.
The materialization state doesn't have to depend on the view lineage
(through ei
Hi Everyone,
In the last community sync on Materialized Views [1], we agreed to split
the information that is used to determine the materialized view staleness
to two parts: Lineage Information and State Information. We have made a lot
of progress on representing both but one issue remains open:
27 matches
Mail list logo