Re: [DISCUSS] Iceberg Materialzied Views

2024-11-10 Thread Benny Chow
I'm also supportive of Jan's PR + adding the catalog, namespace and table (all optional) for readability in the refresh state. Let's keep the big picture in mind that we have achieved a spec that enables engine A to produce a MV and engine B to consume it. This can significantly reduce the cost t

Re: [DISCUSS] Iceberg Materialzied Views

2024-11-10 Thread Walaa Eldin Moustafa
Thanks Steven. I feel we have not completely agreed on the PR yet. I also think it will be a dependency to move forward here. Let me address the comments and get further feedback. On Sun, Nov 10, 2024 at 10:46 AM Steven Wu wrote: > I also feel good about moving forward with Jan's latest descrip

Re: [DISCUSS] Iceberg Materialzied Views

2024-10-01 Thread Steven Wu
Let me recap and see if we are on the same page. 1) we have some consensus on the refresh-state on the storage table. it would contain these fields: UUID, snapshot-id (for table) or version-id (for view), namespace, table. 2) there is no consensus if lineage info is needed in the view definition. w

Re: [DISCUSS] Iceberg Materialzied Views

2024-10-01 Thread Benny Chow
Hi Jan Both sound good to me. (No lineage in views and assumption about UUIDs being unique across catalogs). I hope we get to voting soon on your PR.. Thanks Benny On Sat, Sep 28, 2024 at 10:52 AM Jan Kaul wrote: > Hi Benny, > > thanks for bringing up the UUID issue. It is my understanding t

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-27 Thread Benny Chow
>> storing the lineage is an optimization that can avoid recomputation/re-parsing. I don't think having the lineage is optimizing much over re-parsing the SQL. The most expensive part of SQL parsing is catalog access which has to happen with lineage anyway. Once the planner has the query tree, i

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-25 Thread Steven Wu
I agree that it is reasonable to assume/restrict view definition and storage table in the same catalog. Hence the storage table reference in the view metadata can include only namespace and table (excluding the engine dependent catalog name/alias). Regarding the question of having lineage metadata

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-20 Thread Jan Kaul
Hi Walaa, It appears that you would like to maintain the lineage structure and not revert to parsing the SQL to obtain identifiers. Initially, one of the reasons for avoiding SQL parsing was to enable consumers who don't understand the SQL dialect of any representation to determine the fresh

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-19 Thread Benny Chow
If Spark added the storage table identifier to the MV, I'm not sure how it could also add a full identifier to the Dremio representation. Spark doesn't know what name Dremio used for the catalog. For the UX issue, I think Jan cleverly called it a "PartialIdentifier" and not a "FullIdentifier" to i

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-19 Thread Walaa Eldin Moustafa
I think the solution for the storage identifier might be shared with the end state solution for the lineage. One could imagine a "full identifier" can be used for the storage table; however, it is "representation"-dependent (i.e., it changes according to which representation it is part of, or rathe

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-19 Thread Benny Chow
Hi Jan "PartialIdentifier" without the catalog name sounds good to me. The storage table and MV have to be in the same catalog. That would be a good fifth requirement to add to the list. Thanks Benny On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul wrote: > Cool, I guess it's easier to resolve these

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-19 Thread Jan Kaul
Cool, I guess it's easier to resolve these kind of things when talking in person. I agree with your requirements and the conclusion to use a map from UUID to snapshot-id/version-id as the refresh-state, as well as dropping the lineage in favor to just re-parsing the SQL query. This gets us aro

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-18 Thread Benny Chow
Steven and I met up yesterday at the Seattle Iceberg meetup and we got to talking about the "catalog alias" issue. He described it as an annoying problem =p I think there are some key requirements we need to support: 1. Different engines can produce and consume shared MVs with freshness validati

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-14 Thread Jan Kaul
How about we make the /catalog_name field/ of the identifier optional? If the field is missing, it references a table/view in the same catalog. If it is present it has to be an engine agnostic catalog name. Shouldn't the catalog_names from the REST catalog spec be engine agnostic? I was wonder

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-13 Thread Benny Chow
The main reason for putting the lineage into the view is so that "another" engine can enumerate out the tables in the view without needing to parse any SQL. But, if we put the lineage under the SQL representation with engine specific catalog names, the "other" engine is not going to be able to use

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-13 Thread Jan Kaul
Hi, regarding our recent discussion on table identifiers with respect to different catalog_names with different query engines. We have the same problem when we want to reference the storage table from the common view. *If we include the catalog_name as part of the identifier, different query

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-11 Thread Walaa Eldin Moustafa
I think this type of discussion is exactly what motivates a clarification in the view spec so that we can resolve MV lineage. Will create separate thread for view spec clarification. Following up on Jan’s point, yes I agree in order to support catalog name, it should be at the representation level

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-11 Thread Jan Kaul
Hi Benny, I think that identifiers only being defined for a certain representation is exactly what we want. Each representation can define their own identifiers that then map to an UUID. This way the "catalog_name" of the identifier for a "Spark" dialect can be different then for a "Dremio" d

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-10 Thread Benny Chow
Hi Walaa, I don't think the current view spec implicitly assumes a common catalog name between engines. I tested this by not specifying the default-catalog and both engines could look up the correct table under the shared default-namespace even when each engine uses a different catalog name. Hi J

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-10 Thread Jan Kaul
Thanks Walaa and Benny for clarifying the problem. I think I have a better understanding now. Sorry for being a bit stubborn before. Wouldn't it make sense then to store the lineage as part of the representation: {     "type": "sql",     "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFR

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-08 Thread Benny Chow
Hi Walaa I did some testing with two different engines (Spark and Dremio) against the same Nessie catalog and created the attached materialized view metadata.json. I see your point now about the SQL identifiers being tightly coupled to the engines. In the metadata JSON, spark refers to the catal

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-06 Thread Benny Chow
Hi Steven Yes, I definitely think #2 is easier and cleaner for both reader and writer and that lineage is a separate feature all together. There's no need to couple materialization state with view lineage. The other way to look at helping to decide between the two options is what is the most per

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-03 Thread Walaa Eldin Moustafa
Hi Dan, Thanks for the feedback. For Drop/Recreate, as long as we verify the UUID is still current (i.e., the table in the catalog still maintains the same UUID as in the one in the lineage), we can assume the table is not dropped and recreated. Once it is dropped or recreated, the UUID in the cat

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-03 Thread Jan Kaul
Hi all,I would like to mention that Walaa's document doesn't mention any downsides of Approach #1 in the comparison.Approach #1 has downsides when views are referenced in the materialized view query. These problems arise because only direct children are stored in the lineage. This leads to:- The li

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-03 Thread Daniel Weeks
I'm generally in favor of approach #1 with UUID in lineage. I think it's helpful to know if the underlying table changes (e.g. identifier remains the same, but the table was changed). However, I'm not sure what the behavior would be in that case. Any refresh at that point would not be able to pr

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-03 Thread Walaa Eldin Moustafa
Hi Steven, Thanks! I see the PR now introduces UUID in the state information. That is great progress. I still have slight preference on where to place UUID (in lineage vs state), which I summarized in this doc [1] as promised. I wrote the doc before UUID was added in the PR, so it still compares w

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-03 Thread Steven Wu
Walaa, for the listed discussion points, how should we move forward? should we have another MV sync meeting? BTW, Jan's latest spec PR addressed my comment on UUID. On Mon, Sep 2, 2024 at 4:35 PM Walaa Eldin Moustafa wrote: > Hi Jan, > > I think we need further discussion for a few reasons: > >

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-02 Thread Walaa Eldin Moustafa
Hi Jan, I think we need further discussion for a few reasons: * Semantic issues are significant, especially in a spec. They could create gaps that are very hard to fix in the future. I would rather spend more time designing them properly than fix the gaps with band-aid solutions in the future. *

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-02 Thread Jan Kaul
Hi Walaa, in my opinion the UUID vs table identifier discussion comes down to the question: do we introduce table identifiers in an opaque struct in the table snapshot summary or do we accept technical drawbacks? Using UUIDs with a lineage struct in the view metadata leads to one of the foll

Re: [DISCUSS] Iceberg Materialzied Views

2024-08-29 Thread Walaa Eldin Moustafa
Hi Jan, I think we need to close the discussion on the UUID vs table identifier options and possibly cast a vote before having a productive discussion on the PR. I did not get a chance yet to post the document on the UUID vs table identifier discussion. I will do that by next week. Thanks, Walaa.

[DISCUSS] Iceberg Materialzied Views

2024-08-29 Thread Jan Kaul
Hi all, to move the Iceberg Materialzied View Proposal forward, I created a PR (https://github.com/apache/iceberg/pull/11041) that adds a section on Materialized Views to the View Spec. I hope we can resolve any remaining questions there, before we can start the voting process for the Proposal