Re: [DISCUSS] Iceberg Materialzied Views

Walaa Eldin Moustafa Sun, 10 Nov 2024 11:32:56 -0800

Thanks Steven. I feel we have not completely agreed on the PR yet. I also
think it will be a dependency to move forward here. Let me address the
comments and get further feedback.



On Sun, Nov 10, 2024 at 10:46 AM Steven Wu <stevenz...@gmail.com> wrote:

> I also feel good about moving forward with Jan's latest description.
>
> I would favor including catalog-name (optional), namespace, table name to
> enhance readability. But either way it is not a blocker for me.
>
> Walaa, regarding your PR on view clarification [1], it is good to
> clarify the expectations. But it can happen async and doesn't have to block
> the MV given the conclusion on view identifiers.
>
> [1] https://github.com/apache/iceberg/pull/11365/
>
> On Sat, Nov 9, 2024 at 7:06 AM Jan Kaul <jank...@mailbox.org.invalid>
> wrote:
>
>> Hello everyone,
>>
>> Following the conclusion of the thread on clarifying the View spec [1],
>> which established that identifiers are clearly defined in the View spec,
>> I'd like to propose that we resolve this discussion and move forward with
>> the materialized view spec.
>>
>> Even if we stay completely clear of identifiers, we have all the
>> necessary components to implement materialized views. We can utilize SQL
>> parsing to obtain catalog identifiers, allowing us to use a refresh-state
>> that only contains the UUID and snapshot_id/version_ids.
>>
>> To enhance readability, we could optionally extend this solution to
>> include storing the catalog-name (optional), namespace, and name of the
>> entity in the refresh-state. I don't have a strong opinion about this.
>>
>> I'm eager to move the discussion forward and would appreciate any ideas
>> on how to resolve this. If needed, we can schedule another online meeting
>> to discuss in person.
>>
>> Best regards, Jan
>>
>> 1.
>> https://lists.apache.org/list?dev@iceberg.apache.org:lte=1M:view%20spec%20improvements
>> On 14.10.24 23:31, Walaa Eldin Moustafa wrote:
>>
>> On the table side, I would avoid encoding the SQL identifier in the
>> refresh state at all since they are relevant to views and the view spec.
>>
>> On the view side, I would wait till we finalize the expectation from SQL
>> table identifiers in the view spec.
>>
>> Thanks,
>> Walaa.
>>
>>
>>
>>
>> On Mon, Oct 14, 2024 at 1:25 PM Steven Wu <stevenz...@gmail.com> wrote:
>>
>>> I am favoring listing all of them in the refresh-state
>>> - table/view UUID (required)
>>> - table/view name (required)
>>> - namespace (required)
>>> - catalog name (optional)
>>>
>>> With catalog alias clarified, engines can use the table identifier
>>> fields to load the table/view from the catalog. UUID is kept to guard
>>> against the case table/view being deleted and recreated.
>>>
>>>
>>> On Sun, Oct 13, 2024 at 11:22 PM Jan Kaul <jank...@mailbox.org.invalid>
>>> <jank...@mailbox.org.invalid> wrote:
>>>
>>>> It looks like the discussion about adopting the View Spec concluded
>>>> that the catalog alias problem is a configuration problem and that the View
>>>> spec is clearly defined in terms of identifiers.
>>>>
>>>> With that, I would propose to not use any identifiers in the
>>>> Materialized View Proposal and let the View Spec define how Identifiers are
>>>> stored and resolved. I think using UUIDs gives us all the capabilities that
>>>> we need and avoids adding additional complexity.
>>>>
>>>> Regarding adding the namespace and name to the refresh-state: I changed
>>>> my mind and think that it would just be easier to stay clear from using any
>>>> identifiers in the proposal. If you only store namespace and name, you
>>>> could run into issues where you have collisions of tables from different
>>>> catalogs. I know that this is just for documentation purposes, but I think
>>>> I would be cleaner if we leave it out.
>>>>
>>>> Best wishes,
>>>>
>>>> Jan
>>>> On 08.10.24 08:13, Walaa Eldin Moustafa wrote:
>>>>
>>>> Here is the thread for reference:
>>>>
>>>> https://lists.apache.org/thread/v8m1tpb91g740gmvqyphhjw37mpr8sl7
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Mon, Oct 7, 2024 at 10:41 PM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> I think we have hit a major blocker in the MV spec, and it boils down
>>>>> to the gaps in the view spec due to its use of SQL identifiers.
>>>>>
>>>>> My opinion is to fix the view spec first (regardless of specific MV
>>>>> requirements) in terms of how it should handle resolving SQL identifiers,
>>>>> then apply that solution to the MV spec. The path forward will be clear
>>>>> then, and the solution will be consistent for both MV and non-MV use 
>>>>> cases.
>>>>>
>>>>> Since the issues are pertinent to the view spec itself, I will be
>>>>> starting a new thread to discuss the view-specific topic. We can get back
>>>>> to this thread once we resolve the view spec issues.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Mon, Oct 7, 2024 at 10:12 AM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> I'm thinking of it slightly differently.
>>>>>>
>>>>>> You are allowed to have multiple catalogs (like in spark)
>>>>>> But you are not allowed to have a view span multiple catalogs, if you
>>>>>> want something like this use a catalog that supports federation. Even in
>>>>>> the federated case you would not be required for the federated catalog to
>>>>>> be your only catalog.
>>>>>>
>>>>>> On Mon, Oct 7, 2024 at 12:07 PM Steven Wu <stevenz...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The implicit/template catalog name won't work when the view SQL
>>>>>>> definition references Iceberg tables from multiple catalogs.
>>>>>>>
>>>>>>> If it is only a single catalog, we can even remove the
>>>>>>> `{this_catalog}` template variable and just assume it is the current
>>>>>>> catalog of view definition..
>>>>>>>
>>>>>>> On Mon, Oct 7, 2024 at 9:31 AM Russell Spitzer <
>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>
>>>>>>>> For Catalog Naming, could we require that all representations use a
>>>>>>>> template value when we store the SQL? This could at least solve the 
>>>>>>>> problem
>>>>>>>> for inter Catalog views, which would solve everything if we have proper
>>>>>>>> catalog federation in the future. For example, "SELECT * from
>>>>>>>> catalog.foo.bar" would be represented as something like "SELECT * from
>>>>>>>> {this_catalog}.foo.bar". Then engines could replace the 
>>>>>>>> "{this_catalog}" in
>>>>>>>> their representation with the current identifier assigned to the 
>>>>>>>> catalog?
>>>>>>>>
>>>>>>>> A full example with Spark
>>>>>>>>
>>>>>>>> Spark with spark.sql.catalog.foo = ....
>>>>>>>> creates a view "SELECT * FROM foo.bar.baz"
>>>>>>>> Which is stored as
>>>>>>>> "SELECT * from {this_catalog}.bar.baz"
>>>>>>>>
>>>>>>>> Then Spark attempting to read from spark.sql.catalog.fuu = .....
>>>>>>>> Reads "SELECT * FROM fuu.bar.baz"
>>>>>>>>
>>>>>>>>
>>>>>>>> I would couple this with requiring all representations of the view
>>>>>>>> using a full identifier and being templated if the engine requires it.
>>>>>>>>
>>>>>>>> I apologize if this has already been discussed
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 4, 2024 at 7:02 AM Jan Kaul
>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>
>>>>>>>>> I don't have a strong opinion on adding the namespace and name to
>>>>>>>>> the refresh-state. Generally, I would be interested in other feedback 
>>>>>>>>> from
>>>>>>>>> the community.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Jan
>>>>>>>>> On 02.10.24 19:19, Steven Wu wrote:
>>>>>>>>>
>>>>>>>>> Yeah, my original intention of adding namespace and table name is
>>>>>>>>> to make the state more human readable, although they are not enough
>>>>>>>>> as identifiers to load source tables/views from catalog.
>>>>>>>>>
>>>>>>>>> Jan and I also brought up the catalog name alias issue in today's
>>>>>>>>> community sync. We are looking forward to more feedback from the 
>>>>>>>>> community.
>>>>>>>>>
>>>>>>>>> On Wed, Oct 2, 2024 at 10:10 AM Benny Chow <btc...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for summarizing Jan!  I agree with all your points.
>>>>>>>>>> Steven suggested including the namespace and table name as part of 
>>>>>>>>>> the
>>>>>>>>>> refresh-state table record.  I think that might be a good addition 
>>>>>>>>>> too.
>>>>>>>>>>
>>>>>>>>>> Pros:
>>>>>>>>>> - Improves readability of the refresh-state since table names are
>>>>>>>>>> human consumable whereas UUIDs are opaque.
>>>>>>>>>> - During refresh state validation, engine could also double check
>>>>>>>>>> that the table name matches with what the UUID points to.  
>>>>>>>>>> Basically, helps
>>>>>>>>>> with UUID validation in case UUID was not generated correctly.
>>>>>>>>>>
>>>>>>>>>> Cons:
>>>>>>>>>> - Namespace + table cannot directly be used by validating engine
>>>>>>>>>> to query the catalog due to the catalog alias issue.
>>>>>>>>>> - Namespace + table could have dupes with cross catalog joins so
>>>>>>>>>> UUID is still required here.
>>>>>>>>>>
>>>>>>>>>> Benny
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 1, 2024 at 11:51 PM Jan Kaul
>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Let me try to summarize the current state. With the current
>>>>>>>>>>> proposal we would introduce the *storage-table* pointer into
>>>>>>>>>>> the view version and the *refresh-state* into the snapshot
>>>>>>>>>>> summary. There was the discussion to add a lineage struct to the 
>>>>>>>>>>> view
>>>>>>>>>>> version but we would refrain from that. More on that later.
>>>>>>>>>>>
>>>>>>>>>>> *Storage-table pointer:*
>>>>>>>>>>>
>>>>>>>>>>> Is a partial identifier containing namespace and name of the
>>>>>>>>>>> referenced storage table. It is assumed that view and storage table 
>>>>>>>>>>> are in
>>>>>>>>>>> the same catalog.
>>>>>>>>>>>
>>>>>>>>>>> *Refresh-state*:
>>>>>>>>>>>
>>>>>>>>>>> The refresh-state is a list of all tables/views referenced in
>>>>>>>>>>> the view query containing their UUID and either snapshot-id or 
>>>>>>>>>>> version-id
>>>>>>>>>>> at the time of the last refresh. It does not contain any part of the
>>>>>>>>>>> identifier like namespace or name.
>>>>>>>>>>>
>>>>>>>>>>> You will find a more detailed description in the PR
>>>>>>>>>>> <https://github.com/apache/iceberg/pull/11041>.
>>>>>>>>>>>
>>>>>>>>>>> *Obtaining the identifiers of the Source tables:*
>>>>>>>>>>>
>>>>>>>>>>> To determine the freshness of the storage-table, we require the
>>>>>>>>>>> identifiers of the source tables. There are two possibilities being
>>>>>>>>>>> discussed.
>>>>>>>>>>>
>>>>>>>>>>> 1. Parse view query to get identifiers
>>>>>>>>>>>
>>>>>>>>>>> 2. Store identifiers in lineage
>>>>>>>>>>>
>>>>>>>>>>> As mentioned before, the lineage suffers from the "catalog
>>>>>>>>>>> alias" problem where different query engines refer to the same 
>>>>>>>>>>> catalog with
>>>>>>>>>>> different names. This also applies to the View spec. Hence, storing
>>>>>>>>>>> identifiers as part of the specification is currently undesirable. 
>>>>>>>>>>> However,
>>>>>>>>>>> we could solve the catalog alias problem in the future and then 
>>>>>>>>>>> adding the
>>>>>>>>>>> lineage to the specification would work.
>>>>>>>>>>>
>>>>>>>>>>> *Regarding backwards compatibility:*
>>>>>>>>>>>
>>>>>>>>>>> The main component for determining the freshness is the
>>>>>>>>>>> refresh-state. It contains the UUIDs and the 
>>>>>>>>>>> snapshot-ids/version-ids of
>>>>>>>>>>> the Source tables. The only question is how does the query engine 
>>>>>>>>>>> obtain
>>>>>>>>>>> the identifiers. For now the proposal is to use SQL parsing.
>>>>>>>>>>>
>>>>>>>>>>> In my opinion introducing the lineage struct in the future does
>>>>>>>>>>> not break backwards compatibility as query engines can still parse 
>>>>>>>>>>> the SQL
>>>>>>>>>>> to obtain the identifiers. Query engines are not forced to use the 
>>>>>>>>>>> lineage
>>>>>>>>>>> and can function just as before. However, certain consumers might 
>>>>>>>>>>> want to
>>>>>>>>>>> use the lineage as they might not understand the SQL dialect. So the
>>>>>>>>>>> lineage can be seen as an extension that would allow more 
>>>>>>>>>>> consumers. But it
>>>>>>>>>>> does not necessarily affect the way a general query engines needs to
>>>>>>>>>>> process a Materialized View.
>>>>>>>>>>>
>>>>>>>>>>> With that I find it reasonable to move ahead without the lineage
>>>>>>>>>>> as part of the view metadata.
>>>>>>>>>>>
>>>>>>>>>>> Best wishes,
>>>>>>>>>>>
>>>>>>>>>>> Jan
>>>>>>>>>>> On 02.10.24 05:59, Steven Wu wrote:
>>>>>>>>>>>
>>>>>>>>>>> Let me recap and see if we are on the same page.
>>>>>>>>>>> 1) we have some consensus on the refresh-state on the storage
>>>>>>>>>>> table. it would contain these fields: UUID, snapshot-id (for table) 
>>>>>>>>>>> or
>>>>>>>>>>> version-id (for view), namespace, table.
>>>>>>>>>>> 2) there is no consensus if lineage info is needed in the view
>>>>>>>>>>> definition. we want to defer/postpone that decision.
>>>>>>>>>>>
>>>>>>>>>>> I have a little concern on postponing the decision on the
>>>>>>>>>>> lineage info. It is not a blocker to move forward. But if later we 
>>>>>>>>>>> decided
>>>>>>>>>>> to add it to the spec, it poses new requirements for the 
>>>>>>>>>>> writer/reader.
>>>>>>>>>>> That would be a breaking change.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 1, 2024 at 8:03 PM Benny Chow <btc...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Jan
>>>>>>>>>>>>
>>>>>>>>>>>> Both sound good to me.  (No lineage in views and assumption
>>>>>>>>>>>> about UUIDs being unique across catalogs).  I hope we get to 
>>>>>>>>>>>> voting soon on
>>>>>>>>>>>> your PR..
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Benny
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Sep 28, 2024 at 10:52 AM Jan Kaul
>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Benny,
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks for bringing up the UUID issue. It is my understanding
>>>>>>>>>>>>> that UUIDs are designed to be used in distributed systems without 
>>>>>>>>>>>>> the need
>>>>>>>>>>>>> for a central coordination process. This is in line with the 
>>>>>>>>>>>>> description
>>>>>>>>>>>>> from the RFC4122 document
>>>>>>>>>>>>> <https://datatracker.ietf.org/doc/html/rfc4122> where it
>>>>>>>>>>>>> says: "One of the main reasons for using UUIDs is that no 
>>>>>>>>>>>>> centralized
>>>>>>>>>>>>> authority is required to administer them".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wikipedia
>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Universally_unique_identifier>
>>>>>>>>>>>>> says:
>>>>>>>>>>>>>
>>>>>>>>>>>>> "While the probability that a UUID will be duplicated is not
>>>>>>>>>>>>> zero, it is generally considered close enough to zero to be 
>>>>>>>>>>>>> negligible"
>>>>>>>>>>>>>
>>>>>>>>>>>>> Consequently, I would argue that it is a reasonable assumption
>>>>>>>>>>>>> that UUIDs are unique across catalogs.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regarding the identifier/catalog-alias problem: As we can
>>>>>>>>>>>>> fallback to SQL parsing and don't require the lineage, I would 
>>>>>>>>>>>>> propose to
>>>>>>>>>>>>> move ahead without the lineage. Especially as this seems to be a 
>>>>>>>>>>>>> problem
>>>>>>>>>>>>> with the View Spec that we can't solve now. If there is a demand 
>>>>>>>>>>>>> to add the
>>>>>>>>>>>>> lineage in the future, once the catalog-alias problem has been 
>>>>>>>>>>>>> solved, we
>>>>>>>>>>>>> can still add it then.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me know your thoughts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jan
>>>>>>>>>>>>> On 28.09.24 07:45, Benny Chow wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> >>  storing the lineage is an optimization that can avoid
>>>>>>>>>>>>> recomputation/re-parsing.
>>>>>>>>>>>>> I don't think having the lineage is optimizing much over
>>>>>>>>>>>>> re-parsing the SQL.  The most expensive part of SQL parsing is 
>>>>>>>>>>>>> catalog
>>>>>>>>>>>>> access which has to happen with lineage anyway.  Once the planner 
>>>>>>>>>>>>> has the
>>>>>>>>>>>>> query tree, it can validate the freshness.  It's not like the 
>>>>>>>>>>>>> planner needs
>>>>>>>>>>>>> to complete logical and physical planning.
>>>>>>>>>>>>>
>>>>>>>>>>>>> >> We could also have the catalog name/alias problem for the
>>>>>>>>>>>>> same engine.
>>>>>>>>>>>>> Yes, this is a general problem with the Iceberg View spec.  I
>>>>>>>>>>>>> guess if two different Spark clusters wanted to share the same 
>>>>>>>>>>>>> view, they
>>>>>>>>>>>>> best not reference the catalog name in their SQLs.  Even then, 
>>>>>>>>>>>>> cross
>>>>>>>>>>>>> catalog joins are not going to work.  Again, these are problems 
>>>>>>>>>>>>> with the
>>>>>>>>>>>>> Iceberg View spec.  I think for the MV spec, as long as we don't 
>>>>>>>>>>>>> propose
>>>>>>>>>>>>> something that involves SQL identifiers, then the MV spec isn't 
>>>>>>>>>>>>> making this
>>>>>>>>>>>>> different engine problem worse.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There's another issue I'd like to bring up about using UUIDs
>>>>>>>>>>>>> which is that these UUIDs are client generated and there's no 
>>>>>>>>>>>>> validation
>>>>>>>>>>>>> that they are indeed globally unique identifiers.  The catalog 
>>>>>>>>>>>>> just
>>>>>>>>>>>>> persists whatever it is given without validating that the UUIDs 
>>>>>>>>>>>>> are indeed
>>>>>>>>>>>>> UUIDs and unique across the catalog.  (I know Nessie is not doing 
>>>>>>>>>>>>> this
>>>>>>>>>>>>> validation).   We are assuming this UUID is not only unique 
>>>>>>>>>>>>> within a
>>>>>>>>>>>>> catalog but is also unique across catalogs.  Thoughts on this?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Benny
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 25, 2024 at 8:01 PM Steven Wu <
>>>>>>>>>>>>> stevenz...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree that it is reasonable to assume/restrict view
>>>>>>>>>>>>>> definition and storage table in the same catalog. Hence the 
>>>>>>>>>>>>>> storage table
>>>>>>>>>>>>>> reference in the view metadata can include only namespace and 
>>>>>>>>>>>>>> table
>>>>>>>>>>>>>> (excluding the engine dependent catalog name/alias).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regarding the question of having lineage metadata in view
>>>>>>>>>>>>>> definition vs re-parsing SQL, I guess storing the lineage is an
>>>>>>>>>>>>>> optimization that can avoid recomputation/re-parsing. would be 
>>>>>>>>>>>>>> good to have
>>>>>>>>>>>>>> more input.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thinking about catalog name/alias again. For the same engine
>>>>>>>>>>>>>> (like Spark), different applications/jobs may configure the 
>>>>>>>>>>>>>> catalog name
>>>>>>>>>>>>>> differently. E.g. Spark catalogs are configured using
>>>>>>>>>>>>>> properties under spark.sql.catalog.(catalog_name). We could
>>>>>>>>>>>>>> also have the catalog name/alias problem for the same engine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 20, 2024 at 12:16 AM Jan Kaul
>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Walaa,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It appears that you would like to maintain the lineage
>>>>>>>>>>>>>>> structure and not revert to parsing the SQL to obtain 
>>>>>>>>>>>>>>> identifiers.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Initially, one of the reasons for avoiding SQL parsing was
>>>>>>>>>>>>>>> to enable consumers who don't understand the SQL dialect of any
>>>>>>>>>>>>>>> representation to determine the freshness of the Materialized 
>>>>>>>>>>>>>>> View (MV).
>>>>>>>>>>>>>>> However, with the "catalog alias" issue, having an identifier 
>>>>>>>>>>>>>>> for some
>>>>>>>>>>>>>>> representation is insufficient, as the *catalog_name* is
>>>>>>>>>>>>>>> unlikely to work for the consumer. Therefore, supporting 
>>>>>>>>>>>>>>> consumers that
>>>>>>>>>>>>>>> don't use a query engine of any representation seems impossible.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Given this, parsing the SQL definition becomes a less
>>>>>>>>>>>>>>> significant drawback, as the consumer must understand the 
>>>>>>>>>>>>>>> dialect anyway.
>>>>>>>>>>>>>>> In fact, simply parsing the SQL definition seems like a more 
>>>>>>>>>>>>>>> robust and
>>>>>>>>>>>>>>> straightforward solution than using a lineage for every 
>>>>>>>>>>>>>>> representation. I
>>>>>>>>>>>>>>> believe this is why Benny suggested reverting to SQL parsing, 
>>>>>>>>>>>>>>> and I agree
>>>>>>>>>>>>>>> with him.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regarding the Storage table identifier: Its design as a
>>>>>>>>>>>>>>> *PartialIdentifier* with only namespace and name fields was
>>>>>>>>>>>>>>> intentional, to avoid the *catalog_name* issue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>>> On 19.09.24 23:16, Benny Chow wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If Spark added the storage table identifier to the MV, I'm
>>>>>>>>>>>>>>> not sure how it could also add a full identifier to the Dremio
>>>>>>>>>>>>>>> representation.  Spark doesn't know what name Dremio used for 
>>>>>>>>>>>>>>> the catalog.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For the UX issue, I think Jan cleverly called it a
>>>>>>>>>>>>>>> "PartialIdentifier" and not a "FullIdentifier" to indicate that 
>>>>>>>>>>>>>>> catalog
>>>>>>>>>>>>>>> name is not even a property of the identifier.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Requirement 3 is for the view's SQL.  I'm not sure there is
>>>>>>>>>>>>>>> a very strong use case to put the storage table into a 
>>>>>>>>>>>>>>> different catalog
>>>>>>>>>>>>>>> than the view.  If we had an engine agnostic solution for it, 
>>>>>>>>>>>>>>> I'm all for
>>>>>>>>>>>>>>> it though...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Benny
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think the solution for the storage identifier might be
>>>>>>>>>>>>>>>> shared with the end state solution for the lineage. One could 
>>>>>>>>>>>>>>>> imagine a
>>>>>>>>>>>>>>>> "full identifier" can be used for the storage table; however, 
>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>> "representation"-dependent (i.e., it changes according to
>>>>>>>>>>>>>>>> which representation it is part of, or rather which engine 
>>>>>>>>>>>>>>>> uses it).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, are we asking engines (or their Iceberg
>>>>>>>>>>>>>>>> implementation) to throw an exception if the full storage 
>>>>>>>>>>>>>>>> table identifier
>>>>>>>>>>>>>>>> was provided as part of the MV definition? Sounds like a not 
>>>>>>>>>>>>>>>> very ideal UX.
>>>>>>>>>>>>>>>> Note that it also conflicts with the spirit of requirement #3.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <
>>>>>>>>>>>>>>>> btc...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Jan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> "PartialIdentifier" without the catalog name sounds good
>>>>>>>>>>>>>>>>> to me.  The storage table and MV have to be in the same 
>>>>>>>>>>>>>>>>> catalog.  That
>>>>>>>>>>>>>>>>> would be a good fifth requirement to add to the list.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Benny
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul
>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cool, I guess it's easier to resolve these kind of things
>>>>>>>>>>>>>>>>>> when talking in person.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I agree with your requirements and the conclusion to use
>>>>>>>>>>>>>>>>>> a map from UUID to snapshot-id/version-id as the 
>>>>>>>>>>>>>>>>>> refresh-state, as well as
>>>>>>>>>>>>>>>>>> dropping the lineage in favor to just re-parsing the SQL 
>>>>>>>>>>>>>>>>>> query. This gets
>>>>>>>>>>>>>>>>>> us around the "catalog alias" issue.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And I'm also OK with every engine requiring their own
>>>>>>>>>>>>>>>>>> representation to use the MV.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> There is still the issue with the identifier of the
>>>>>>>>>>>>>>>>>> storage table and its catalog_name. Should we use an 
>>>>>>>>>>>>>>>>>> "PartialIdentifier"
>>>>>>>>>>>>>>>>>> with a namespace and a name field, like so:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>     namespace: ["bronze"],
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>     name: "lineitem"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And require the storage table to be in the same catalog
>>>>>>>>>>>>>>>>>> as the MV itself?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>>>>>> On 19.09.24 00:50, Benny Chow wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Steven and I met up yesterday at the Seattle Iceberg
>>>>>>>>>>>>>>>>>> meetup and we got to talking about the "catalog alias" 
>>>>>>>>>>>>>>>>>> issue.  He described
>>>>>>>>>>>>>>>>>> it as an annoying problem =p
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think there are some key requirements we need to
>>>>>>>>>>>>>>>>>> support:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. Different engines can produce and consume shared MVs
>>>>>>>>>>>>>>>>>> with freshness validation.
>>>>>>>>>>>>>>>>>> 2. We cannot force different engines to standardize on
>>>>>>>>>>>>>>>>>> the alias they use for the catalog.
>>>>>>>>>>>>>>>>>> 3. We cannot force different SQL representations to
>>>>>>>>>>>>>>>>>> exclude catalog names from table identifiers or not use 
>>>>>>>>>>>>>>>>>> fully qualified
>>>>>>>>>>>>>>>>>> table names.
>>>>>>>>>>>>>>>>>> 4. MV SQL can join tables and views from multiple
>>>>>>>>>>>>>>>>>> catalogs -> Inevitable with Nessie, Polaris, Unity, Tabular 
>>>>>>>>>>>>>>>>>> and others...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The producing engine has to save refresh state
>>>>>>>>>>>>>>>>>> information to let consuming engine know that table X is at 
>>>>>>>>>>>>>>>>>> what snapshot
>>>>>>>>>>>>>>>>>> at the time of materialization.  The only way to identify 
>>>>>>>>>>>>>>>>>> this table across
>>>>>>>>>>>>>>>>>> different catalog names is to use the cross catalog, 
>>>>>>>>>>>>>>>>>> globally unique UUID.
>>>>>>>>>>>>>>>>>> I think our only option is to have the refresh state map 
>>>>>>>>>>>>>>>>>> UUID to snapshot
>>>>>>>>>>>>>>>>>> ids and view version ids.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Assuming the above is how we store the refresh state, how
>>>>>>>>>>>>>>>>>> does the consuming engine determine the current snapshot 
>>>>>>>>>>>>>>>>>> ids?  The
>>>>>>>>>>>>>>>>>> consuming engine will have to fully expand the query tree at 
>>>>>>>>>>>>>>>>>> which point it
>>>>>>>>>>>>>>>>>> will have the UUIDs as well as the latest snapshot ids/view 
>>>>>>>>>>>>>>>>>> versions.  This
>>>>>>>>>>>>>>>>>> can then be diffed against the materialization refresh state 
>>>>>>>>>>>>>>>>>> to determine
>>>>>>>>>>>>>>>>>> freshness.  There isn't a need to store the view lineage 
>>>>>>>>>>>>>>>>>> information to map
>>>>>>>>>>>>>>>>>> from UUID to the consumer specific identifier so that the 
>>>>>>>>>>>>>>>>>> consumer can then
>>>>>>>>>>>>>>>>>> call back into the catalog with that identifier to get the 
>>>>>>>>>>>>>>>>>> latest state.
>>>>>>>>>>>>>>>>>> The consuming engine might as well just re-parse the SQL and 
>>>>>>>>>>>>>>>>>> expand the
>>>>>>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Personally, I'm OK with requiring that an engine must
>>>>>>>>>>>>>>>>>> have its own SQL representation in order to use the MV.  To 
>>>>>>>>>>>>>>>>>> me, being able
>>>>>>>>>>>>>>>>>> to fulfill the key requirements above is much more important.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Benny
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul
>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to