Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-21 Thread Walaa Eldin Moustafa
Hi Jan, I do not think this is feasible because it assumes the catalog identifiers do not collide across catalogs. Anyways, let us not over engineer this use case. As I mentioned, it was for illustration purposes. Since the discussion moved from “UUIDs vs Sequence numbers” to “UUIDs vs catalog tab

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-20 Thread Walaa Eldin Moustafa
Theoretically, we could have multiple catalogs each with different table name entries but referring to the same Iceberg table metadata, and hence same UUIDs (view metadata cannot be shared since they are strongly bound to the catalog identifiers). I understand this is not an everyday scenario but i

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-19 Thread Benny Chow
Hi Walaa, I personally don't see a semantic issue with putting the table identifiers in the refresh state. The purpose of the refresh state is to basically take a snapshot of the table and view versions at the time of materialization. Directly using table identifiers seems pretty natural to me.

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-19 Thread Walaa Eldin Moustafa
Hi Micah, it is mostly about the typical results of denormalization such as data consistency, management complexity, integrity, etc. However, as mentioned earlier, the main reason would be the semantic gap around using catalog table identifiers as a concept in the table (more specifically snapshot

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-19 Thread Micah Kornfield
> > Thanks Micah, for the latter, I meant the type of denormalization of > repeating a 3-part name as opposed to using an ID. Is the concern here just metadata size or something else? For size I think if this is really anticipated to be a problem that it is likely for the state map in general, a

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
Thanks Micah, for the latter, I meant the type of denormalization of repeating a 3-part name as opposed to using an ID. On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield wrote: > However, this still does not address the semantic issue which is more >> fundamental in my opinion. The Iceberg table s

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Micah Kornfield
> > However, this still does not address the semantic issue which is more > fundamental in my opinion. The Iceberg table spec is not aware of catalog > table identifiers and this use will be the first break of this abstraction. IIUC, based on Jan's comments, we are not going to modify the table s

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
That is right. I agree that in the case of using catalog identifiers in state information, using them in lineage information would be a nice-to-have and not a requirement. However, this still does not address the semantic issue which is more fundamental in my opinion. The Iceberg table spec is not

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul
Hi Walaa,I would argue that for the refresh operation the query engine has to parse the query and then somehow execute it. For a full refresh it will directly execute the query and for a incremental refresh it will execute a modified version. Therefore it has to fully expand the query tree.Best wis

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
Thanks Jan for the summary. For this point: > For a refresh operation the query engine has to parse the SQL and fully expand the lineage with it's children anyway. So the lineage is not strictly required. If the lineage is provided at creation time by the respective engine, the refresh operatio

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul
As the table I created is not properly shown in the mailing list I'll reformat the summary of the different drawbacks again: Drawbacks of (no lineage, refresh-state key = identifier): - introduces catalog identifiers into table metadata (#4) - query engine has to expand lineage at refresh time

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul
Hi, Thanks Micah for clearly stating the requirements. I think this gives better clarity for the discussion. It seems like we don't have a solution that satisfies all requirements at once. So we would need to choose which has the fewest drawbacks. I would like to summarize the different dra

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Walaa Eldin Moustafa
The option of using catalog identifiers in the state map still requires keeping lineage information in the view because REFRESH MV needs the latest fully expanded children (which could have changed from the set of children currently in the state map), without reparsing the view tree. Therefore, cat

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Micah Kornfield
> > I think given the constraint that catalog lookup has to be by identifier > and not UUID, I'd prefer using identifier in the refresh state. If we use > identifiers, we can directly parallelize the catalog calls to fetch the > latest state. If we use UUID, the engine has to go back to the MV an

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Micah Kornfield
> > I do not think 3 and 4 are at odds with each other (for example > maintaining both lineage map and state map through UUID can achieve both). I agree, I should have been more clear that #5 (limiting new view versions) also comes into play. If UUID is used in lineage as part of the view spec,

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Benny Chow
I think given the constraint that catalog lookup has to be by identifier and not UUID, I'd prefer using identifier in the refresh state. If we use identifiers, we can directly parallelize the catalog calls to fetch the latest state. If we use UUID, the engine has to go back to the MV and possibly

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Walaa Eldin Moustafa
Thanks Jan, Micah, and Karuppayya for chiming in. I do not think 3 and 4 are at odds with each other (for example maintaining both lineage map and state map through UUID can achieve both). Also, I do not think we can drop the lineage map since in many catalogs, the only lookup method is by the cat

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread karuppayya
+1 to storing the refresh state as a map of UUIDs to snapshot IDs, and deferring the inclusion of lineage to a future iteration.(like Micha mentioned) This would greatly simplify the current design. Also in terms of identifiers to use(UUID or catalog identifier) for the refresh state We will not b

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Micah Kornfield
I think it might be worth restating perceived requirements and making sure there is alignment on them. If I am reading correctly, I think the following are perceived requirements: 1. An engine must be able to unambiguously detect that an underlying queried entity has changed or not via metadata to

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Benny Chow
If we go with either UUID or Table Identifier + VersionID/SnapshotId in the refresh state, then this list is fully expanded already. So, to validate the freshness of a materialization, the engine doesn't even need to look at the view lineage. IMO, the view lineage is nice to have but not a necess

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Jan Kaul
Hi all, I would like to reemphasize the purpose of the refresh-state for materialized views. The purpose is to determine if the precomputed data is fresh, stale or invalid. For that the current snapshot-id of every table in the query tree has to be fetched from the catalog by using its full i

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-14 Thread Walaa Eldin Moustafa
Thanks Benny. For refs, I am +1 to represent them as UUID + optional ref, although we can iterate ohe exact JSON structure (e.g., another option is splitting for (UUID) state from (UUID + ref) state into two separate higher-level fields). Generally agree on REFRESH VIEW strategy could be up to the

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-14 Thread Benny Chow
I'd like to hear Jan's feedback on using UUID and normalizing the view lineage. I'm on board with this change. I updated the fully spec'd out example using UUID and a normalized view linage: https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit#heading=h.o6yn2lnpxo

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-13 Thread Walaa Eldin Moustafa
Hi Everyone, Just a follow up on this thread. Thanks Benny and Micah for the discussion on the doc [1]. We have been converging more on using UUIDs from the discussion. The only open question was related to UUIDs (of underlying views/tables) being stale upon a REPLACE (or DROP and CREATE) operatio

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-08 Thread Walaa Eldin Moustafa
Thanks Benny! We discussed this option during the meeting but we did not prefer it because we did not want to leak the SQL identifiers to the storage table since SQL identifiers are view concepts and fit better with the view. Thanks, Walaa. On Thu, Aug 8, 2024 at 4:12 PM Benny Chow wrote: > May

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-08 Thread Benny Chow
Maybe a third option is to decouple the view lineage and materialization state. The view lineage can just list out the SQL identifiers+ref... we can still decide whether this is just direct children or fully expanded. The materialization state doesn't have to depend on the view lineage (through ei

[DISCUSS] Materialized Views: Lineage and State information

2024-08-08 Thread Walaa Eldin Moustafa
Hi Everyone, In the last community sync on Materialized Views [1], we agreed to split the information that is used to determine the materialized view staleness to two parts: Lineage Information and State Information. We have made a lot of progress on representing both but one issue remains open: