Re: [DISCUSS] Iceberg Materialzied Views

Steven Wu Tue, 01 Oct 2024 21:00:50 -0700

Let me recap and see if we are on the same page.
1) we have some consensus on the refresh-state on the storage table. it
would contain these fields: UUID, snapshot-id (for table) or version-id
(for view), namespace, table.
2) there is no consensus if lineage info is needed in the view definition.
we want to defer/postpone that decision.


I have a little concern on postponing the decision on the lineage info. It
is not a blocker to move forward. But if later we decided to add it to the
spec, it poses new requirements for the writer/reader. That would be a
breaking change.



On Tue, Oct 1, 2024 at 8:03 PM Benny Chow <[email protected]> wrote:

> Hi Jan
>
> Both sound good to me.  (No lineage in views and assumption about UUIDs
> being unique across catalogs).  I hope we get to voting soon on your PR..
>
> Thanks
> Benny
>
> On Sat, Sep 28, 2024 at 10:52 AM Jan Kaul <[email protected]>
> wrote:
>
>> Hi Benny,
>>
>> thanks for bringing up the UUID issue. It is my understanding that UUIDs
>> are designed to be used in distributed systems without the need for a
>> central coordination process. This is in line with the description from the 
>> RFC4122
>> document <https://datatracker.ietf.org/doc/html/rfc4122> where it says:
>> "One of the main reasons for using UUIDs is that no centralized authority
>> is required to administer them".
>>
>> Wikipedia <https://en.wikipedia.org/wiki/Universally_unique_identifier>
>> says:
>>
>> "While the probability that a UUID will be duplicated is not zero, it is
>> generally considered close enough to zero to be negligible"
>>
>> Consequently, I would argue that it is a reasonable assumption that UUIDs
>> are unique across catalogs.
>>
>>
>> Regarding the identifier/catalog-alias problem: As we can fallback to SQL
>> parsing and don't require the lineage, I would propose to move ahead
>> without the lineage. Especially as this seems to be a problem with the View
>> Spec that we can't solve now. If there is a demand to add the lineage in
>> the future, once the catalog-alias problem has been solved, we can still
>> add it then.
>>
>> Let me know your thoughts.
>>
>> Best wishes,
>>
>> Jan
>> On 28.09.24 07:45, Benny Chow wrote:
>>
>> >>  storing the lineage is an optimization that can avoid
>> recomputation/re-parsing.
>> I don't think having the lineage is optimizing much over re-parsing the
>> SQL.  The most expensive part of SQL parsing is catalog access which has to
>> happen with lineage anyway.  Once the planner has the query tree, it can
>> validate the freshness.  It's not like the planner needs to complete
>> logical and physical planning.
>>
>> >> We could also have the catalog name/alias problem for the same engine.
>> Yes, this is a general problem with the Iceberg View spec.  I guess if
>> two different Spark clusters wanted to share the same view, they best not
>> reference the catalog name in their SQLs.  Even then, cross catalog joins
>> are not going to work.  Again, these are problems with the Iceberg View
>> spec.  I think for the MV spec, as long as we don't propose something that
>> involves SQL identifiers, then the MV spec isn't making this different
>> engine problem worse.
>>
>> There's another issue I'd like to bring up about using UUIDs which is
>> that these UUIDs are client generated and there's no validation that they
>> are indeed globally unique identifiers.  The catalog just persists whatever
>> it is given without validating that the UUIDs are indeed UUIDs and unique
>> across the catalog.  (I know Nessie is not doing this validation).   We are
>> assuming this UUID is not only unique within a catalog but is also unique
>> across catalogs.  Thoughts on this?
>>
>> Thanks
>> Benny
>>
>>
>>
>> On Wed, Sep 25, 2024 at 8:01 PM Steven Wu <[email protected]> wrote:
>>
>>> I agree that it is reasonable to assume/restrict view definition and
>>> storage table in the same catalog. Hence the storage table reference in the
>>> view metadata can include only namespace and table (excluding the engine
>>> dependent catalog name/alias).
>>>
>>> Regarding the question of having lineage metadata in view definition vs
>>> re-parsing SQL, I guess storing the lineage is an optimization that can
>>> avoid recomputation/re-parsing. would be good to have more input.
>>>
>>> Thinking about catalog name/alias again. For the same engine (like
>>> Spark), different applications/jobs may configure the catalog name
>>> differently. E.g. Spark catalogs are configured using properties under
>>> spark.sql.catalog.(catalog_name). We could also have the catalog
>>> name/alias problem for the same engine.
>>>
>>>
>>>
>>>
>>> On Fri, Sep 20, 2024 at 12:16 AM Jan Kaul <[email protected]>
>>> <[email protected]> wrote:
>>>
>>>> Hi Walaa,
>>>>
>>>> It appears that you would like to maintain the lineage structure and
>>>> not revert to parsing the SQL to obtain identifiers.
>>>>
>>>> Initially, one of the reasons for avoiding SQL parsing was to enable
>>>> consumers who don't understand the SQL dialect of any representation to
>>>> determine the freshness of the Materialized View (MV). However, with the
>>>> "catalog alias" issue, having an identifier for some representation is
>>>> insufficient, as the *catalog_name* is unlikely to work for the
>>>> consumer. Therefore, supporting consumers that don't use a query engine of
>>>> any representation seems impossible.
>>>>
>>>> Given this, parsing the SQL definition becomes a less significant
>>>> drawback, as the consumer must understand the dialect anyway. In fact,
>>>> simply parsing the SQL definition seems like a more robust and
>>>> straightforward solution than using a lineage for every representation. I
>>>> believe this is why Benny suggested reverting to SQL parsing, and I agree
>>>> with him.
>>>>
>>>> Regarding the Storage table identifier: Its design as a
>>>> *PartialIdentifier* with only namespace and name fields was
>>>> intentional, to avoid the *catalog_name* issue.
>>>>
>>>> Best regards,
>>>>
>>>> Jan
>>>> On 19.09.24 23:16, Benny Chow wrote:
>>>>
>>>> If Spark added the storage table identifier to the MV, I'm not sure how
>>>> it could also add a full identifier to the Dremio representation.
>>>> Spark doesn't know what name Dremio used for the catalog.
>>>>
>>>> For the UX issue, I think Jan cleverly called it a "PartialIdentifier"
>>>> and not a "FullIdentifier" to indicate that catalog name is not even a
>>>> property of the identifier.
>>>>
>>>> Requirement 3 is for the view's SQL.  I'm not sure there is a very
>>>> strong use case to put the storage table into a different catalog than the
>>>> view.  If we had an engine agnostic solution for it, I'm all for it
>>>> though...
>>>>
>>>> Thanks
>>>> Benny
>>>>
>>>>
>>>> On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa <
>>>> [email protected]> wrote:
>>>>
>>>>> I think the solution for the storage identifier might be shared with
>>>>> the end state solution for the lineage. One could imagine a "full
>>>>> identifier" can be used for the storage table; however, it is
>>>>> "representation"-dependent (i.e., it changes according to
>>>>> which representation it is part of, or rather which engine uses it).
>>>>>
>>>>> Also, are we asking engines (or their Iceberg implementation) to throw
>>>>> an exception if the full storage table identifier was provided as part of
>>>>> the MV definition? Sounds like a not very ideal UX. Note that it also
>>>>> conflicts with the spirit of requirement #3.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>> On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <[email protected]> wrote:
>>>>>
>>>>>> Hi Jan
>>>>>>
>>>>>> "PartialIdentifier" without the catalog name sounds good to me.  The
>>>>>> storage table and MV have to be in the same catalog.  That would be a 
>>>>>> good
>>>>>> fifth requirement to add to the list.
>>>>>>
>>>>>> Thanks
>>>>>> Benny
>>>>>>
>>>>>> On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul
>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>
>>>>>>> Cool, I guess it's easier to resolve these kind of things when
>>>>>>> talking in person.
>>>>>>>
>>>>>>> I agree with your requirements and the conclusion to use a map from
>>>>>>> UUID to snapshot-id/version-id as the refresh-state, as well as dropping
>>>>>>> the lineage in favor to just re-parsing the SQL query. This gets us 
>>>>>>> around
>>>>>>> the "catalog alias" issue.
>>>>>>>
>>>>>>> And I'm also OK with every engine requiring their own representation
>>>>>>> to use the MV.
>>>>>>>
>>>>>>> There is still the issue with the identifier of the storage table
>>>>>>> and its catalog_name. Should we use an "PartialIdentifier" with a 
>>>>>>> namespace
>>>>>>> and a name field, like so:
>>>>>>>
>>>>>>> {
>>>>>>>
>>>>>>>     namespace: ["bronze"],
>>>>>>>
>>>>>>>     name: "lineitem"
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> And require the storage table to be in the same catalog as the MV
>>>>>>> itself?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jan
>>>>>>> On 19.09.24 00:50, Benny Chow wrote:
>>>>>>>
>>>>>>> Steven and I met up yesterday at the Seattle Iceberg meetup and we
>>>>>>> got to talking about the "catalog alias" issue.  He described it as an
>>>>>>> annoying problem =p
>>>>>>>
>>>>>>> I think there are some key requirements we need to support:
>>>>>>>
>>>>>>> 1. Different engines can produce and consume shared MVs with
>>>>>>> freshness validation.
>>>>>>> 2. We cannot force different engines to standardize on the alias
>>>>>>> they use for the catalog.
>>>>>>> 3. We cannot force different SQL representations to exclude catalog
>>>>>>> names from table identifiers or not use fully qualified table names.
>>>>>>> 4. MV SQL can join tables and views from multiple catalogs ->
>>>>>>> Inevitable with Nessie, Polaris, Unity, Tabular and others...
>>>>>>>
>>>>>>> The producing engine has to save refresh state information to let
>>>>>>> consuming engine know that table X is at what snapshot at the time of
>>>>>>> materialization.  The only way to identify this table across different
>>>>>>> catalog names is to use the cross catalog, globally unique UUID.  I 
>>>>>>> think
>>>>>>> our only option is to have the refresh state map UUID to snapshot ids 
>>>>>>> and
>>>>>>> view version ids.
>>>>>>>
>>>>>>> Assuming the above is how we store the refresh state, how does the
>>>>>>> consuming engine determine the current snapshot ids?  The consuming 
>>>>>>> engine
>>>>>>> will have to fully expand the query tree at which point it will have the
>>>>>>> UUIDs as well as the latest snapshot ids/view versions.  This can then 
>>>>>>> be
>>>>>>> diffed against the materialization refresh state to determine freshness.
>>>>>>> There isn't a need to store the view lineage information to map from 
>>>>>>> UUID
>>>>>>> to the consumer specific identifier so that the consumer can then call 
>>>>>>> back
>>>>>>> into the catalog with that identifier to get the latest state.  The
>>>>>>> consuming engine might as well just re-parse the SQL and expand the 
>>>>>>> query.
>>>>>>>
>>>>>>> Personally, I'm OK with requiring that an engine must have its own
>>>>>>> SQL representation in order to use the MV.  To me, being able to fulfill
>>>>>>> the key requirements above is much more important.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Benny
>>>>>>>
>>>>>>> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul
>>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>>
>>>>>>>> How about we make the *catalog_name field* of the identifier
>>>>>>>> optional? If the field is missing, it references a table/view in the 
>>>>>>>> same
>>>>>>>> catalog. If it is present it has to be an engine agnostic catalog name.
>>>>>>>> Shouldn't the catalog_names from the REST catalog spec be engine 
>>>>>>>> agnostic?
>>>>>>>>
>>>>>>>> I was wondering, is there no way to prescribe a catalog_name in
>>>>>>>> Spark or Dremio? What do you do if you include two Nessie catalogs? 
>>>>>>>> They
>>>>>>>> can't both be called LocalNessie.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jan
>>>>>>>> On 14.09.24 01:23, Benny Chow wrote:
>>>>>>>>
>>>>>>>> The main reason for putting the lineage into the view is so that
>>>>>>>> "another" engine can enumerate out the tables in the view without 
>>>>>>>> needing
>>>>>>>> to parse any SQL.  But, if we put the lineage under the SQL 
>>>>>>>> representation
>>>>>>>> with engine specific catalog names, the "other" engine is not going to 
>>>>>>>> be
>>>>>>>> able to use those identifiers to look up the tables.  The "other" 
>>>>>>>> engine
>>>>>>>> can only lookup those identifiers using its engine specific catalog 
>>>>>>>> name.
>>>>>>>> It may be possible to enumerate the tables at the view version level 
>>>>>>>> ONLY
>>>>>>>> if those identifiers don't include the catalog name.  However, if you 
>>>>>>>> have
>>>>>>>> a view with a cross catalog join, then the tables coming from the other
>>>>>>>> catalog have to be fully qualified.  But then the problem is that each
>>>>>>>> engine will also alias the other catalog differently too.
>>>>>>>>
>>>>>>>> So, I think to summarize *multi-engine* view interoperability:
>>>>>>>>
>>>>>>>>    - default-catalog can't be specified
>>>>>>>>    - default-namespace can be specified
>>>>>>>>    - View SQL can only references tables/views from the same
>>>>>>>>    catalog
>>>>>>>>
>>>>>>>> I think these are reasonable constraints for multi-engine use
>>>>>>>> cases.  If reasonable, for MVs, then the storage table, refresh-state 
>>>>>>>> and
>>>>>>>> lineage (at the view version level), could all be based on *engine
>>>>>>>> agnostic* identifiers without the catalog name.  The MV and
>>>>>>>> storage table would have to be in the same catalog.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Benny
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul
>>>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> regarding our recent discussion on table identifiers with respect
>>>>>>>>> to different catalog_names with different query engines. We have the 
>>>>>>>>> same
>>>>>>>>> problem when we want to reference the storage table from the common 
>>>>>>>>> view.
>>>>>>>>> *If we include the catalog_name as part of the identifier,
>>>>>>>>> different query engines might not be able to load the storage table. *
>>>>>>>>> We could enforce that every storage table has to be part of the
>>>>>>>>> same catalog as the main view. This way an identifier without the
>>>>>>>>> catalog_name would be enough to point to the correct storage table.
>>>>>>>>>
>>>>>>>>> What are your thoughts on this?
>>>>>>>>>
>>>>>>>>> Best wishes,
>>>>>>>>>
>>>>>>>>> Jan
>>>>>>>>> On 11.09.24 16:05, Walaa Eldin Moustafa wrote:
>>>>>>>>>
>>>>>>>>> I think this type of discussion is exactly what motivates a
>>>>>>>>> clarification in the view spec so that we can resolve MV lineage. Will
>>>>>>>>> create separate thread for view spec clarification.
>>>>>>>>>
>>>>>>>>> Following up on Jan’s point, yes I agree in order to support
>>>>>>>>> catalog name, it should be at the representation level, but catalog 
>>>>>>>>> name
>>>>>>>>> does not really depend on the “dialect” but rather on the “engine”; 
>>>>>>>>> hence
>>>>>>>>> the discussion becomes a little more involved.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>> On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul
>>>>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Benny,
>>>>>>>>>>
>>>>>>>>>> I think that identifiers only being defined for a certain
>>>>>>>>>> representation is exactly what we want. Each representation can 
>>>>>>>>>> define
>>>>>>>>>> their own identifiers that then map to an UUID. This way the 
>>>>>>>>>> "catalog_name"
>>>>>>>>>> of the identifier for a "Spark" dialect can be different then for a
>>>>>>>>>> "Dremio" dialect.
>>>>>>>>>>
>>>>>>>>>> The important part is that we still have a list of identifiers
>>>>>>>>>> for each representation that we can use with the catalog to obtain 
>>>>>>>>>> the
>>>>>>>>>> state of the source tables.
>>>>>>>>>>
>>>>>>>>>> Best wishes,
>>>>>>>>>>
>>>>>>>>>> Jan
>>>>>>>>>> On 11.09.24 01:33, Benny Chow wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Walaa, I don't think the current view spec implicitly assumes
>>>>>>>>>> a common catalog name between engines.  I tested this by not 
>>>>>>>>>> specifying the
>>>>>>>>>> default-catalog and both engines could look up the correct table 
>>>>>>>>>> under the
>>>>>>>>>> shared default-namespace even when each engine uses a different 
>>>>>>>>>> catalog
>>>>>>>>>> name.
>>>>>>>>>>
>>>>>>>>>> Hi Jan, I think the issue with putting the lineage as part of the
>>>>>>>>>> representation is that that identifier only makes sense for that
>>>>>>>>>> representation's engine.  In your example, the catalog aliased as 
>>>>>>>>>> "iceberg"
>>>>>>>>>> in spark is going to have a different name in Dremio or Trino.
>>>>>>>>>>
>>>>>>>>>> IMO, if we are to store a lineage for a view, it should consist
>>>>>>>>>> of something engine agnostic like the table/view UUIDs.  This would 
>>>>>>>>>> be
>>>>>>>>>> stored at the view version level and not the representation level.  
>>>>>>>>>> I think
>>>>>>>>>> as we get into more of these multi-engine, multi-catalog use cases 
>>>>>>>>>> for
>>>>>>>>>> views, the Iceberg Catalog is going to need to do a better job at 
>>>>>>>>>> handling
>>>>>>>>>> CRUD by UUID instead of engine specific identifiers.  Another 
>>>>>>>>>> scenario we
>>>>>>>>>> need to think through is a view that joins tables from two different
>>>>>>>>>> catalogs.  How would we represent the lineage for that in an engine
>>>>>>>>>> agnostic way?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Benny
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul
>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Walaa and Benny for clarifying the problem. I think I
>>>>>>>>>>> have a better understanding now. Sorry for being a bit stubborn 
>>>>>>>>>>> before.
>>>>>>>>>>>
>>>>>>>>>>> Wouldn't it make sense then to store the lineage as part of the
>>>>>>>>>>> representation:
>>>>>>>>>>>
>>>>>>>>>>> {
>>>>>>>>>>>
>>>>>>>>>>>     "type": "sql",
>>>>>>>>>>>
>>>>>>>>>>>     "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM
>>>>>>>>>>> events\nGROUP BY 2",
>>>>>>>>>>>
>>>>>>>>>>>     "dialect": "spark",
>>>>>>>>>>>
>>>>>>>>>>>     "lineage": [{
>>>>>>>>>>>
>>>>>>>>>>>         "identifier": { "catalog": "iceberg", "namespace":
>>>>>>>>>>> "public", "table": "events"},
>>>>>>>>>>>
>>>>>>>>>>>         "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"
>>>>>>>>>>>
>>>>>>>>>>>     }]
>>>>>>>>>>>
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Best wishes,
>>>>>>>>>>>
>>>>>>>>>>> Jan
>>>>>>>>>>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>
>>>>>>>>>>> Benny, thank you so much for performing the experiment. Glad
>>>>>>>>>>> that using UUIDs as keys in the state map makes more sense now.
>>>>>>>>>>>
>>>>>>>>>>> For the issue with the view spec being restrictive, I agree and
>>>>>>>>>>> I have raised the concern on the view spec PR last year [1]. I 
>>>>>>>>>>> think there
>>>>>>>>>>> is some area of improvement here. At the least, if it is 
>>>>>>>>>>> restrictive, it
>>>>>>>>>>> should be explicitly stated. I will start a thread on how to 
>>>>>>>>>>> approach the
>>>>>>>>>>> view spec. We may need to get more insight on the view spec before
>>>>>>>>>>> finalizing the MV spec, because view spec will determine if we 
>>>>>>>>>>> should
>>>>>>>>>>> proceed with one lineage (with the implicitly assumed common 
>>>>>>>>>>> catalog name),
>>>>>>>>>>> or with multiple lineages (one per engine or catalog name).
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Walaa.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Walaa
>>>>>>>>>>>>
>>>>>>>>>>>> I did some testing with two different engines (Spark and
>>>>>>>>>>>> Dremio) against the same Nessie catalog and created the attached
>>>>>>>>>>>> materialized view metadata.json.  I see your point now about the 
>>>>>>>>>>>> SQL
>>>>>>>>>>>> identifiers being tightly coupled to the engines.  In the metadata 
>>>>>>>>>>>> JSON,
>>>>>>>>>>>> spark refers to the catalog as "SparkNessie", whereas Dremio 
>>>>>>>>>>>> refers to the
>>>>>>>>>>>> catalog as "LocalNessie".  So, this means that the fully qualified 
>>>>>>>>>>>> view and
>>>>>>>>>>>> table identifiers are engine specific and Dremio can't lookup a 
>>>>>>>>>>>> Spark
>>>>>>>>>>>> identifier and vice versa.
>>>>>>>>>>>>
>>>>>>>>>>>> *So, I think it does make sense now for the refresh-state to
>>>>>>>>>>>> key off the UUIDs and not use engine specific identifiers.  *This
>>>>>>>>>>>> also means that the materization consumer will have to fully 
>>>>>>>>>>>> expand the
>>>>>>>>>>>> query tree and basically diff the UUID + latest snapshot ids 
>>>>>>>>>>>> against the
>>>>>>>>>>>> refresh state.  Would it ever make sense for the Iceberg Catalog 
>>>>>>>>>>>> to expose
>>>>>>>>>>>> a bulk lookup API by UUID?
>>>>>>>>>>>>
>>>>>>>>>>>> As a side note, it seems that for a materialized view to work
>>>>>>>>>>>> with multiple engines, the default-catalog and default-namespace 
>>>>>>>>>>>> can't be
>>>>>>>>>>>> used unless both engines use the same catalog name which seems 
>>>>>>>>>>>> pretty
>>>>>>>>>>>> restrictive to me.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the great discussions
>>>>>>>>>>>> Benny
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Jan, we definitely can store SQL identifiers of multiple
>>>>>>>>>>>>> representations in Approach 1.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The takeaway is that SQL identifiers are highly coupled with
>>>>>>>>>>>>> engines, just like views. It makes sense to track both together 
>>>>>>>>>>>>> for
>>>>>>>>>>>>> consistency.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
>>>>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Walaa, thanks you for bringing up this use case. I think we
>>>>>>>>>>>>>> need to keep in mind that we require identifiers to interface 
>>>>>>>>>>>>>> with the
>>>>>>>>>>>>>> catalog. We cannot use UUIDs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Which means you also wouldn't be able to use Approach 1 for
>>>>>>>>>>>>>> your use case because you can't store the catalog names of 
>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>> representations in the lineage. You would need to fallback to 
>>>>>>>>>>>>>> parsing the
>>>>>>>>>>>>>> SQL for a particular representation and rebuilding the full 
>>>>>>>>>>>>>> query tree to
>>>>>>>>>>>>>> obtain the identifiers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You could do the same for Approach 2. So I don't see why
>>>>>>>>>>>>>> Approach 1 would yield any benefits.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>> On 07.09.24 00:01, Steven Wu wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Benny, `default-catalog` is optional, while
>>>>>>>>>>>>>> `default-namespace` is required.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I will retract my comment on the `summary`. it indicates the
>>>>>>>>>>>>>> engine that made the revision to the current view version. it 
>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>> really matter for multi-engine/representation support.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope
>>>>>>>>>>>>>>> it wouldn't have to be under a specific representation.
>>>>>>>>>>>>>>> Walaa - That's a serious concern...  If the same catalog is
>>>>>>>>>>>>>>> aliased differently by two different engines, then the basic 
>>>>>>>>>>>>>>> view spec
>>>>>>>>>>>>>>> seems broken to me since "default-namespace" includes the 
>>>>>>>>>>>>>>> catalog alias and
>>>>>>>>>>>>>>> is outside of the SQL representation.  Does that mean for a 
>>>>>>>>>>>>>>> view to be
>>>>>>>>>>>>>>> interoperable, we require different engines to use the same 
>>>>>>>>>>>>>>> catalog name?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Walaa, thanks for bringing up the interesting case of
>>>>>>>>>>>>>>>> multiple representations (for different engines), which 
>>>>>>>>>>>>>>>> definitely requires
>>>>>>>>>>>>>>>> more discussion from the community.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When I am looking at the view spec, I am seeing some
>>>>>>>>>>>>>>>> conflict. "summary" field seems meant for only one engine, 
>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>> "representations" support multiple engines.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> "summary" : {
>>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16>
>>>>>>>>>>>>>>>> "engine-name" : "Spark",
>>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17>
>>>>>>>>>>>>>>>> "engineVersion" : "3.3.2"
>>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18>
>>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19>
>>>>>>>>>>>>>>>> "representations" : [ {
>>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20>
>>>>>>>>>>>>>>>> "type" : "sql",
>>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-21>
>>>>>>>>>>>>>>>> "sql" : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM 
>>>>>>>>>>>>>>>> events\nGROUP BY
>>>>>>>>>>>>>>>> 2",
>>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-22>
>>>>>>>>>>>>>>>> "dialect" : "spark"
>>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23>
>>>>>>>>>>>>>>>> } ]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With multiple representations/engines, I guess one engine
>>>>>>>>>>>>>>>> will be responsible for the storage table refresh and other 
>>>>>>>>>>>>>>>> engines are
>>>>>>>>>>>>>>>> read only. If we want to store the lineage info in the view, 
>>>>>>>>>>>>>>>> it probably
>>>>>>>>>>>>>>>> needs to be part of the "representation" struct so that each
>>>>>>>>>>>>>>>> engine/representation stores its own lineage info..
>>>>>>>>>>>>>>>> Who is to validate/ensure that the SQL representation is
>>>>>>>>>>>>>>>> actually semantically identical (minus syntax differences 
>>>>>>>>>>>>>>>> across engines)?
>>>>>>>>>>>>>>>> I guess this responsibility is left to the user who owns and 
>>>>>>>>>>>>>>>> manages the
>>>>>>>>>>>>>>>> view.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to