Re: [DISCUSS] Iceberg Materialzied Views

Benny Chow Sun, 10 Nov 2024 21:32:49 -0800

I'm also supportive of Jan's PR + adding the catalog, namespace and table
(all optional) for readability in the refresh state.


Let's keep the big picture in mind that we have achieved a spec that
enables engine A to produce a MV and engine B to consume it.  This can
significantly reduce the cost to run engine B which is a huge win for
Iceberg.

Yes, there are still MV cases which don't work and we have been going back
and forth on them but those can be solved in the future as other parts of
Iceberg evolve.

Thanks
Benny

On Sun, Nov 10, 2024 at 11:32 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Thanks Steven. I feel we have not completely agreed on the PR yet. I also
> think it will be a dependency to move forward here. Let me address the
> comments and get further feedback.
>
>
> On Sun, Nov 10, 2024 at 10:46 AM Steven Wu <stevenz...@gmail.com> wrote:
>
>> I also feel good about moving forward with Jan's latest description.
>>
>> I would favor including catalog-name (optional), namespace, table name to
>> enhance readability. But either way it is not a blocker for me.
>>
>> Walaa, regarding your PR on view clarification [1], it is good to
>> clarify the expectations. But it can happen async and doesn't have to block
>> the MV given the conclusion on view identifiers.
>>
>> [1] https://github.com/apache/iceberg/pull/11365/
>>
>> On Sat, Nov 9, 2024 at 7:06 AM Jan Kaul <jank...@mailbox.org.invalid>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> Following the conclusion of the thread on clarifying the View spec [1],
>>> which established that identifiers are clearly defined in the View spec,
>>> I'd like to propose that we resolve this discussion and move forward with
>>> the materialized view spec.
>>>
>>> Even if we stay completely clear of identifiers, we have all the
>>> necessary components to implement materialized views. We can utilize SQL
>>> parsing to obtain catalog identifiers, allowing us to use a refresh-state
>>> that only contains the UUID and snapshot_id/version_ids.
>>>
>>> To enhance readability, we could optionally extend this solution to
>>> include storing the catalog-name (optional), namespace, and name of the
>>> entity in the refresh-state. I don't have a strong opinion about this.
>>>
>>> I'm eager to move the discussion forward and would appreciate any ideas
>>> on how to resolve this. If needed, we can schedule another online meeting
>>> to discuss in person.
>>>
>>> Best regards, Jan
>>>
>>> 1.
>>> https://lists.apache.org/list?dev@iceberg.apache.org:lte=1M:view%20spec%20improvements
>>> On 14.10.24 23:31, Walaa Eldin Moustafa wrote:
>>>
>>> On the table side, I would avoid encoding the SQL identifier in the
>>> refresh state at all since they are relevant to views and the view spec.
>>>
>>> On the view side, I would wait till we finalize the expectation from SQL
>>> table identifiers in the view spec.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>>
>>>
>>> On Mon, Oct 14, 2024 at 1:25 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>> I am favoring listing all of them in the refresh-state
>>>> - table/view UUID (required)
>>>> - table/view name (required)
>>>> - namespace (required)
>>>> - catalog name (optional)
>>>>
>>>> With catalog alias clarified, engines can use the table identifier
>>>> fields to load the table/view from the catalog. UUID is kept to guard
>>>> against the case table/view being deleted and recreated.
>>>>
>>>>
>>>> On Sun, Oct 13, 2024 at 11:22 PM Jan Kaul <jank...@mailbox.org.invalid>
>>>> <jank...@mailbox.org.invalid> wrote:
>>>>
>>>>> It looks like the discussion about adopting the View Spec concluded
>>>>> that the catalog alias problem is a configuration problem and that the 
>>>>> View
>>>>> spec is clearly defined in terms of identifiers.
>>>>>
>>>>> With that, I would propose to not use any identifiers in the
>>>>> Materialized View Proposal and let the View Spec define how Identifiers 
>>>>> are
>>>>> stored and resolved. I think using UUIDs gives us all the capabilities 
>>>>> that
>>>>> we need and avoids adding additional complexity.
>>>>>
>>>>> Regarding adding the namespace and name to the refresh-state: I
>>>>> changed my mind and think that it would just be easier to stay clear from
>>>>> using any identifiers in the proposal. If you only store namespace and
>>>>> name, you could run into issues where you have collisions of tables from
>>>>> different catalogs. I know that this is just for documentation purposes,
>>>>> but I think I would be cleaner if we leave it out.
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Jan
>>>>> On 08.10.24 08:13, Walaa Eldin Moustafa wrote:
>>>>>
>>>>> Here is the thread for reference:
>>>>>
>>>>> https://lists.apache.org/thread/v8m1tpb91g740gmvqyphhjw37mpr8sl7
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Mon, Oct 7, 2024 at 10:41 PM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> I think we have hit a major blocker in the MV spec, and it boils down
>>>>>> to the gaps in the view spec due to its use of SQL identifiers.
>>>>>>
>>>>>> My opinion is to fix the view spec first (regardless of specific MV
>>>>>> requirements) in terms of how it should handle resolving SQL identifiers,
>>>>>> then apply that solution to the MV spec. The path forward will be clear
>>>>>> then, and the solution will be consistent for both MV and non-MV use 
>>>>>> cases.
>>>>>>
>>>>>> Since the issues are pertinent to the view spec itself, I will be
>>>>>> starting a new thread to discuss the view-specific topic. We can get back
>>>>>> to this thread once we resolve the view spec issues.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 7, 2024 at 10:12 AM Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> I'm thinking of it slightly differently.
>>>>>>>
>>>>>>> You are allowed to have multiple catalogs (like in spark)
>>>>>>> But you are not allowed to have a view span multiple catalogs, if
>>>>>>> you want something like this use a catalog that supports federation. 
>>>>>>> Even
>>>>>>> in the federated case you would not be required for the federated 
>>>>>>> catalog
>>>>>>> to be your only catalog.
>>>>>>>
>>>>>>> On Mon, Oct 7, 2024 at 12:07 PM Steven Wu <stevenz...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The implicit/template catalog name won't work when the view SQL
>>>>>>>> definition references Iceberg tables from multiple catalogs.
>>>>>>>>
>>>>>>>> If it is only a single catalog, we can even remove the
>>>>>>>> `{this_catalog}` template variable and just assume it is the current
>>>>>>>> catalog of view definition..
>>>>>>>>
>>>>>>>> On Mon, Oct 7, 2024 at 9:31 AM Russell Spitzer <
>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> For Catalog Naming, could we require that all representations use
>>>>>>>>> a template value when we store the SQL? This could at least solve the
>>>>>>>>> problem for inter Catalog views, which would solve everything if we 
>>>>>>>>> have
>>>>>>>>> proper catalog federation in the future. For example, "SELECT * from
>>>>>>>>> catalog.foo.bar" would be represented as something like "SELECT * from
>>>>>>>>> {this_catalog}.foo.bar". Then engines could replace the 
>>>>>>>>> "{this_catalog}" in
>>>>>>>>> their representation with the current identifier assigned to the 
>>>>>>>>> catalog?
>>>>>>>>>
>>>>>>>>> A full example with Spark
>>>>>>>>>
>>>>>>>>> Spark with spark.sql.catalog.foo = ....
>>>>>>>>> creates a view "SELECT * FROM foo.bar.baz"
>>>>>>>>> Which is stored as
>>>>>>>>> "SELECT * from {this_catalog}.bar.baz"
>>>>>>>>>
>>>>>>>>> Then Spark attempting to read from spark.sql.catalog.fuu = .....
>>>>>>>>> Reads "SELECT * FROM fuu.bar.baz"
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would couple this with requiring all representations of the view
>>>>>>>>> using a full identifier and being templated if the engine requires it.
>>>>>>>>>
>>>>>>>>> I apologize if this has already been discussed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Oct 4, 2024 at 7:02 AM Jan Kaul
>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> I don't have a strong opinion on adding the namespace and name to
>>>>>>>>>> the refresh-state. Generally, I would be interested in other 
>>>>>>>>>> feedback from
>>>>>>>>>> the community.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jan
>>>>>>>>>> On 02.10.24 19:19, Steven Wu wrote:
>>>>>>>>>>
>>>>>>>>>> Yeah, my original intention of adding namespace and table name is
>>>>>>>>>> to make the state more human readable, although they are not enough
>>>>>>>>>> as identifiers to load source tables/views from catalog.
>>>>>>>>>>
>>>>>>>>>> Jan and I also brought up the catalog name alias issue in today's
>>>>>>>>>> community sync. We are looking forward to more feedback from the 
>>>>>>>>>> community.
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 2, 2024 at 10:10 AM Benny Chow <btc...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for summarizing Jan!  I agree with all your points.
>>>>>>>>>>> Steven suggested including the namespace and table name as part of 
>>>>>>>>>>> the
>>>>>>>>>>> refresh-state table record.  I think that might be a good addition 
>>>>>>>>>>> too.
>>>>>>>>>>>
>>>>>>>>>>> Pros:
>>>>>>>>>>> - Improves readability of the refresh-state since table names
>>>>>>>>>>> are human consumable whereas UUIDs are opaque.
>>>>>>>>>>> - During refresh state validation, engine could also double
>>>>>>>>>>> check that the table name matches with what the UUID points to.  
>>>>>>>>>>> Basically,
>>>>>>>>>>> helps with UUID validation in case UUID was not generated correctly.
>>>>>>>>>>>
>>>>>>>>>>> Cons:
>>>>>>>>>>> - Namespace + table cannot directly be used by validating engine
>>>>>>>>>>> to query the catalog due to the catalog alias issue.
>>>>>>>>>>> - Namespace + table could have dupes with cross catalog joins so
>>>>>>>>>>> UUID is still required here.
>>>>>>>>>>>
>>>>>>>>>>> Benny
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 1, 2024 at 11:51 PM Jan Kaul
>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Let me try to summarize the current state. With the current
>>>>>>>>>>>> proposal we would introduce the *storage-table* pointer into
>>>>>>>>>>>> the view version and the *refresh-state* into the snapshot
>>>>>>>>>>>> summary. There was the discussion to add a lineage struct to the 
>>>>>>>>>>>> view
>>>>>>>>>>>> version but we would refrain from that. More on that later.
>>>>>>>>>>>>
>>>>>>>>>>>> *Storage-table pointer:*
>>>>>>>>>>>>
>>>>>>>>>>>> Is a partial identifier containing namespace and name of the
>>>>>>>>>>>> referenced storage table. It is assumed that view and storage 
>>>>>>>>>>>> table are in
>>>>>>>>>>>> the same catalog.
>>>>>>>>>>>>
>>>>>>>>>>>> *Refresh-state*:
>>>>>>>>>>>>
>>>>>>>>>>>> The refresh-state is a list of all tables/views referenced in
>>>>>>>>>>>> the view query containing their UUID and either snapshot-id or 
>>>>>>>>>>>> version-id
>>>>>>>>>>>> at the time of the last refresh. It does not contain any part of 
>>>>>>>>>>>> the
>>>>>>>>>>>> identifier like namespace or name.
>>>>>>>>>>>>
>>>>>>>>>>>> You will find a more detailed description in the PR
>>>>>>>>>>>> <https://github.com/apache/iceberg/pull/11041>.
>>>>>>>>>>>>
>>>>>>>>>>>> *Obtaining the identifiers of the Source tables:*
>>>>>>>>>>>>
>>>>>>>>>>>> To determine the freshness of the storage-table, we require the
>>>>>>>>>>>> identifiers of the source tables. There are two possibilities being
>>>>>>>>>>>> discussed.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Parse view query to get identifiers
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Store identifiers in lineage
>>>>>>>>>>>>
>>>>>>>>>>>> As mentioned before, the lineage suffers from the "catalog
>>>>>>>>>>>> alias" problem where different query engines refer to the same 
>>>>>>>>>>>> catalog with
>>>>>>>>>>>> different names. This also applies to the View spec. Hence, storing
>>>>>>>>>>>> identifiers as part of the specification is currently undesirable. 
>>>>>>>>>>>> However,
>>>>>>>>>>>> we could solve the catalog alias problem in the future and then 
>>>>>>>>>>>> adding the
>>>>>>>>>>>> lineage to the specification would work.
>>>>>>>>>>>>
>>>>>>>>>>>> *Regarding backwards compatibility:*
>>>>>>>>>>>>
>>>>>>>>>>>> The main component for determining the freshness is the
>>>>>>>>>>>> refresh-state. It contains the UUIDs and the 
>>>>>>>>>>>> snapshot-ids/version-ids of
>>>>>>>>>>>> the Source tables. The only question is how does the query engine 
>>>>>>>>>>>> obtain
>>>>>>>>>>>> the identifiers. For now the proposal is to use SQL parsing.
>>>>>>>>>>>>
>>>>>>>>>>>> In my opinion introducing the lineage struct in the future does
>>>>>>>>>>>> not break backwards compatibility as query engines can still parse 
>>>>>>>>>>>> the SQL
>>>>>>>>>>>> to obtain the identifiers. Query engines are not forced to use the 
>>>>>>>>>>>> lineage
>>>>>>>>>>>> and can function just as before. However, certain consumers might 
>>>>>>>>>>>> want to
>>>>>>>>>>>> use the lineage as they might not understand the SQL dialect. So 
>>>>>>>>>>>> the
>>>>>>>>>>>> lineage can be seen as an extension that would allow more 
>>>>>>>>>>>> consumers. But it
>>>>>>>>>>>> does not necessarily affect the way a general query engines needs 
>>>>>>>>>>>> to
>>>>>>>>>>>> process a Materialized View.
>>>>>>>>>>>>
>>>>>>>>>>>> With that I find it reasonable to move ahead without the
>>>>>>>>>>>> lineage as part of the view metadata.
>>>>>>>>>>>>
>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>
>>>>>>>>>>>> Jan
>>>>>>>>>>>> On 02.10.24 05:59, Steven Wu wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Let me recap and see if we are on the same page.
>>>>>>>>>>>> 1) we have some consensus on the refresh-state on the storage
>>>>>>>>>>>> table. it would contain these fields: UUID, snapshot-id (for 
>>>>>>>>>>>> table) or
>>>>>>>>>>>> version-id (for view), namespace, table.
>>>>>>>>>>>> 2) there is no consensus if lineage info is needed in the view
>>>>>>>>>>>> definition. we want to defer/postpone that decision.
>>>>>>>>>>>>
>>>>>>>>>>>> I have a little concern on postponing the decision on the
>>>>>>>>>>>> lineage info. It is not a blocker to move forward. But if later we 
>>>>>>>>>>>> decided
>>>>>>>>>>>> to add it to the spec, it poses new requirements for the 
>>>>>>>>>>>> writer/reader.
>>>>>>>>>>>> That would be a breaking change.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 1, 2024 at 8:03 PM Benny Chow <btc...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Jan
>>>>>>>>>>>>>
>>>>>>>>>>>>> Both sound good to me.  (No lineage in views and assumption
>>>>>>>>>>>>> about UUIDs being unique across catalogs).  I hope we get to 
>>>>>>>>>>>>> voting soon on
>>>>>>>>>>>>> your PR..
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Benny
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Sep 28, 2024 at 10:52 AM Jan Kaul
>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Benny,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> thanks for bringing up the UUID issue. It is my understanding
>>>>>>>>>>>>>> that UUIDs are designed to be used in distributed systems 
>>>>>>>>>>>>>> without the need
>>>>>>>>>>>>>> for a central coordination process. This is in line with the 
>>>>>>>>>>>>>> description
>>>>>>>>>>>>>> from the RFC4122 document
>>>>>>>>>>>>>> <https://datatracker.ietf.org/doc/html/rfc4122> where it
>>>>>>>>>>>>>> says: "One of the main reasons for using UUIDs is that no 
>>>>>>>>>>>>>> centralized
>>>>>>>>>>>>>> authority is required to administer them".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Wikipedia
>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Universally_unique_identifier>
>>>>>>>>>>>>>> says:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "While the probability that a UUID will be duplicated is not
>>>>>>>>>>>>>> zero, it is generally considered close enough to zero to be 
>>>>>>>>>>>>>> negligible"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Consequently, I would argue that it is a reasonable
>>>>>>>>>>>>>> assumption that UUIDs are unique across catalogs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regarding the identifier/catalog-alias problem: As we can
>>>>>>>>>>>>>> fallback to SQL parsing and don't require the lineage, I would 
>>>>>>>>>>>>>> propose to
>>>>>>>>>>>>>> move ahead without the lineage. Especially as this seems to be a 
>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>> with the View Spec that we can't solve now. If there is a demand 
>>>>>>>>>>>>>> to add the
>>>>>>>>>>>>>> lineage in the future, once the catalog-alias problem has been 
>>>>>>>>>>>>>> solved, we
>>>>>>>>>>>>>> can still add it then.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let me know your thoughts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>> On 28.09.24 07:45, Benny Chow wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >>  storing the lineage is an optimization that can avoid
>>>>>>>>>>>>>> recomputation/re-parsing.
>>>>>>>>>>>>>> I don't think having the lineage is optimizing much over
>>>>>>>>>>>>>> re-parsing the SQL.  The most expensive part of SQL parsing is 
>>>>>>>>>>>>>> catalog
>>>>>>>>>>>>>> access which has to happen with lineage anyway.  Once the 
>>>>>>>>>>>>>> planner has the
>>>>>>>>>>>>>> query tree, it can validate the freshness.  It's not like the 
>>>>>>>>>>>>>> planner needs
>>>>>>>>>>>>>> to complete logical and physical planning.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >> We could also have the catalog name/alias problem for the
>>>>>>>>>>>>>> same engine.
>>>>>>>>>>>>>> Yes, this is a general problem with the Iceberg View spec.  I
>>>>>>>>>>>>>> guess if two different Spark clusters wanted to share the same 
>>>>>>>>>>>>>> view, they
>>>>>>>>>>>>>> best not reference the catalog name in their SQLs.  Even then, 
>>>>>>>>>>>>>> cross
>>>>>>>>>>>>>> catalog joins are not going to work.  Again, these are problems 
>>>>>>>>>>>>>> with the
>>>>>>>>>>>>>> Iceberg View spec.  I think for the MV spec, as long as we don't 
>>>>>>>>>>>>>> propose
>>>>>>>>>>>>>> something that involves SQL identifiers, then the MV spec isn't 
>>>>>>>>>>>>>> making this
>>>>>>>>>>>>>> different engine problem worse.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There's another issue I'd like to bring up about using UUIDs
>>>>>>>>>>>>>> which is that these UUIDs are client generated and there's no 
>>>>>>>>>>>>>> validation
>>>>>>>>>>>>>> that they are indeed globally unique identifiers.  The catalog 
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>> persists whatever it is given without validating that the UUIDs 
>>>>>>>>>>>>>> are indeed
>>>>>>>>>>>>>> UUIDs and unique across the catalog.  (I know Nessie is not 
>>>>>>>>>>>>>> doing this
>>>>>>>>>>>>>> validation).   We are assuming this UUID is not only unique 
>>>>>>>>>>>>>> within a
>>>>>>>>>>>>>> catalog but is also unique across catalogs.  Thoughts on this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Benny
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 25, 2024 at 8:01 PM Steven Wu <
>>>>>>>>>>>>>> stevenz...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree that it is reasonable to assume/restrict view
>>>>>>>>>>>>>>> definition and storage table in the same catalog. Hence the 
>>>>>>>>>>>>>>> storage table
>>>>>>>>>>>>>>> reference in the view metadata can include only namespace and 
>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>> (excluding the engine dependent catalog name/alias).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regarding the question of having lineage metadata in view
>>>>>>>>>>>>>>> definition vs re-parsing SQL, I guess storing the lineage is an
>>>>>>>>>>>>>>> optimization that can avoid recomputation/re-parsing. would be 
>>>>>>>>>>>>>>> good to have
>>>>>>>>>>>>>>> more input.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thinking about catalog name/alias again. For the same engine
>>>>>>>>>>>>>>> (like Spark), different applications/jobs may configure the 
>>>>>>>>>>>>>>> catalog name
>>>>>>>>>>>>>>> differently. E.g. Spark catalogs are configured using
>>>>>>>>>>>>>>> properties under spark.sql.catalog.(catalog_name). We could
>>>>>>>>>>>>>>> also have the catalog name/alias problem for the same engine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 20, 2024 at 12:16 AM Jan Kaul
>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Walaa,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It appears that you would like to maintain the lineage
>>>>>>>>>>>>>>>> structure and not revert to parsing the SQL to obtain 
>>>>>>>>>>>>>>>> identifiers.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Initially, one of the reasons for avoiding SQL parsing was
>>>>>>>>>>>>>>>> to enable consumers who don't understand the SQL dialect of any
>>>>>>>>>>>>>>>> representation to determine the freshness of the Materialized 
>>>>>>>>>>>>>>>> View (MV).
>>>>>>>>>>>>>>>> However, with the "catalog alias" issue, having an identifier 
>>>>>>>>>>>>>>>> for some
>>>>>>>>>>>>>>>> representation is insufficient, as the *catalog_name* is
>>>>>>>>>>>>>>>> unlikely to work for the consumer. Therefore, supporting 
>>>>>>>>>>>>>>>> consumers that
>>>>>>>>>>>>>>>> don't use a query engine of any representation seems 
>>>>>>>>>>>>>>>> impossible.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Given this, parsing the SQL definition becomes a less
>>>>>>>>>>>>>>>> significant drawback, as the consumer must understand the 
>>>>>>>>>>>>>>>> dialect anyway.
>>>>>>>>>>>>>>>> In fact, simply parsing the SQL definition seems like a more 
>>>>>>>>>>>>>>>> robust and
>>>>>>>>>>>>>>>> straightforward solution than using a lineage for every 
>>>>>>>>>>>>>>>> representation. I
>>>>>>>>>>>>>>>> believe this is why Benny suggested reverting to SQL parsing, 
>>>>>>>>>>>>>>>> and I agree
>>>>>>>>>>>>>>>> with him.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regarding the Storage table identifier: Its design as a
>>>>>>>>>>>>>>>> *PartialIdentifier* with only namespace and name fields
>>>>>>>>>>>>>>>> was intentional, to avoid the *catalog_name* issue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>>>> On 19.09.24 23:16, Benny Chow wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If Spark added the storage table identifier to the MV, I'm
>>>>>>>>>>>>>>>> not sure how it could also add a full identifier to the Dremio
>>>>>>>>>>>>>>>> representation.  Spark doesn't know what name Dremio used for 
>>>>>>>>>>>>>>>> the catalog.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For the UX issue, I think Jan cleverly called it a
>>>>>>>>>>>>>>>> "PartialIdentifier" and not a "FullIdentifier" to indicate 
>>>>>>>>>>>>>>>> that catalog
>>>>>>>>>>>>>>>> name is not even a property of the identifier.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Requirement 3 is for the view's SQL.  I'm not sure there is
>>>>>>>>>>>>>>>> a very strong use case to put the storage table into a 
>>>>>>>>>>>>>>>> different catalog
>>>>>>>>>>>>>>>> than the view.  If we had an engine agnostic solution for it, 
>>>>>>>>>>>>>>>> I'm all for
>>>>>>>>>>>>>>>> it though...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Benny
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think the solution for the storage identifier might be
>>>>>>>>>>>>>>>>> shared with the end state solution for the lineage. One could 
>>>>>>>>>>>>>>>>> imagine a
>>>>>>>>>>>>>>>>> "full identifier" can be used for the storage table; however, 
>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>> "representation"-dependent (i.e., it changes according to
>>>>>>>>>>>>>>>>> which representation it is part of, or rather which engine 
>>>>>>>>>>>>>>>>> uses it).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, are we asking engines (or their Iceberg
>>>>>>>>>>>>>>>>> implementation) to throw an exception if the full storage 
>>>>>>>>>>>>>>>>> table identifier
>>>>>>>>>>>>>>>>> was provided as part of the MV definition? Sounds like a not 
>>>>>>>>>>>>>>>>> very ideal UX.
>>>>>>>>>>>>>>>>> Note that it also conflicts with the spirit of requirement #3.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <
>>>>>>>>>>>>>>>>> btc...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Jan
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "PartialIdentifier" without the catalog name sounds good
>>>>>>>>>>>>>>>>>> to me.  The storage table and MV have to be in the same 
>>>>>>>>>>>>>>>>>> catalog.  That
>>>>>>>>>>>>>>>>>> would be a good fifth requirement to add to the list.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Benny
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul
>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cool, I guess it's easier to resolve these kind of
>>>>>>>>>>>>>>>>>>> things when talking in person.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I agree with your requirements and the conclusion to use
>>>>>>>>>>>>>>>>>>> a map from UUID to snapshot-id/version-id as the 
>>>>>>>>>>>>>>>>>>> refresh-state, as well as
>>>>>>>>>>>>>>>>>>> dropping the lineage in favor to just re-parsing the SQL 
>>>>>>>>>>>>>>>>>>> query. This gets
>>>>>>>>>>>>>>>>>>> us around the "catalog alias" issue.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> And I'm also OK with every engine requiring their own
>>>>>>>>>>>>>>>>>>> representation to use the MV.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There is still the issue with the identifier of the
>>>>>>>>>>>>>>>>>>> storage table and its catalog_name. Should we use an 
>>>>>>>>>>>>>>>>>>> "PartialIdentifier"
>>>>>>>>>>>>>>>>>>> with a namespace and a name field, like so:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     namespace: ["bronze"],
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     name: "lineitem"
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> And require the storage table to be in the same catalog
>>>>>>>>>>>>>>>>>>> as the MV itself?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>>>>>>> On 19.09.24 00:50, Benny Chow wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Steven and I met up yesterday at the Seattle Iceberg
>>>>>>>>>>>>>>>>>>> meetup and we got to talking about the "catalog alias" 
>>>>>>>>>>>>>>>>>>> issue.  He described
>>>>>>>>>>>>>>>>>>> it as an annoying problem =p
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think there are some key requirements we need to
>>>>>>>>>>>>>>>>>>> support:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. Different engines can produce and consume shared MVs
>>>>>>>>>>>>>>>>>>> with freshness validation.
>>>>>>>>>>>>>>>>>>> 2. We cannot force different engines to standardize on
>>>>>>>>>>>>>>>>>>> the alias they use for the catalog.
>>>>>>>>>>>>>>>>>>> 3. We cannot force different SQL representations to
>>>>>>>>>>>>>>>>>>> exclude catalog names from table identifiers or not use 
>>>>>>>>>>>>>>>>>>> fully qualified
>>>>>>>>>>>>>>>>>>> table names.
>>>>>>>>>>>>>>>>>>> 4. MV SQL can join tables and views from multiple
>>>>>>>>>>>>>>>>>>> catalogs -> Inevitable with Nessie, Polaris, Unity, Tabular 
>>>>>>>>>>>>>>>>>>> and others...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The producing engine has to save refresh state
>>>>>>>>>>>>>>>>>>> information to let consuming engine know that table X is at 
>>>>>>>>>>>>>>>>>>> what snapshot
>>>>>>>>>>>>>>>>>>> at the time of materialization.  The only way to identify 
>>>>>>>>>>>>>>>>>>> this table across
>>>>>>>>>>>>>>>>>>> different catalog names is to use the cross catalog, 
>>>>>>>>>>>>>>>>>>> globally unique UUID.
>>>>>>>>>>>>>>>>>>> I think our only option is to have the refresh state map 
>>>>>>>>>>>>>>>>>>> UUID to snapshot
>>>>>>>>>>>>>>>>>>> ids and view version ids.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Assuming the above is how we store the refresh state,
>>>>>>>>>>>>>>>>>>> how does the consuming engine determine the current 
>>>>>>>>>>>>>>>>>>> snapshot ids?  The
>>>>>>>>>>>>>>>>>>> consuming engine will have to fully expand the query tree 
>>>>>>>>>>>>>>>>>>> at which point it
>>>>>>>>>>>>>>>>>>> will have the UUIDs as well as the latest snapshot ids/view 
>>>>>>>>>>>>>>>>>>> versions.  This
>>>>>>>>>>>>>>>>>>> can then be diffed against the materialization refresh 
>>>>>>>>>>>>>>>>>>> state to determine
>>>>>>>>>>>>>>>>>>> freshness.  There isn't a need to store the view lineage 
>>>>>>>>>>>>>>>>>>> information to map
>>>>>>>>>>>>>>>>>>> from UUID to the consumer specific identifier so that the 
>>>>>>>>>>>>>>>>>>> consumer can then
>>>>>>>>>>>>>>>>>>> call back into the catalog with that identifier to get the 
>>>>>>>>>>>>>>>>>>> latest state.
>>>>>>>>>>>>>>>>>>> The consuming engine might as well just re-parse the SQL 
>>>>>>>>>>>>>>>>>>> and expand the
>>>>>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Personally, I'm OK with requiring that an engine must
>>>>>>>>>>>>>>>>>>> have its own SQL representation in order to use the MV.  To 
>>>>>>>>>>>>>>>>>>> me, being able
>>>>>>>>>>>>>>>>>>> to fulfill the key requirements above is much more 
>>>>>>>>>>>>>>>>>>> important.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Benny
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul
>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to