Re: [DISCUSS] Iceberg Materialzied Views

Benny Chow Tue, 01 Oct 2024 20:02:58 -0700

Hi Jan

Both sound good to me.  (No lineage in views and assumption about UUIDs
being unique across catalogs).  I hope we get to voting soon on your PR..


Thanks
Benny

On Sat, Sep 28, 2024 at 10:52 AM Jan Kaul <[email protected]>
wrote:

> Hi Benny,
>
> thanks for bringing up the UUID issue. It is my understanding that UUIDs
> are designed to be used in distributed systems without the need for a
> central coordination process. This is in line with the description from the 
> RFC4122
> document <https://datatracker.ietf.org/doc/html/rfc4122> where it says:
> "One of the main reasons for using UUIDs is that no centralized authority
> is required to administer them".
>
> Wikipedia <https://en.wikipedia.org/wiki/Universally_unique_identifier>
> says:
>
> "While the probability that a UUID will be duplicated is not zero, it is
> generally considered close enough to zero to be negligible"
>
> Consequently, I would argue that it is a reasonable assumption that UUIDs
> are unique across catalogs.
>
>
> Regarding the identifier/catalog-alias problem: As we can fallback to SQL
> parsing and don't require the lineage, I would propose to move ahead
> without the lineage. Especially as this seems to be a problem with the View
> Spec that we can't solve now. If there is a demand to add the lineage in
> the future, once the catalog-alias problem has been solved, we can still
> add it then.
>
> Let me know your thoughts.
>
> Best wishes,
>
> Jan
> On 28.09.24 07:45, Benny Chow wrote:
>
> >>  storing the lineage is an optimization that can avoid
> recomputation/re-parsing.
> I don't think having the lineage is optimizing much over re-parsing the
> SQL.  The most expensive part of SQL parsing is catalog access which has to
> happen with lineage anyway.  Once the planner has the query tree, it can
> validate the freshness.  It's not like the planner needs to complete
> logical and physical planning.
>
> >> We could also have the catalog name/alias problem for the same engine.
> Yes, this is a general problem with the Iceberg View spec.  I guess if two
> different Spark clusters wanted to share the same view, they best not
> reference the catalog name in their SQLs.  Even then, cross catalog joins
> are not going to work.  Again, these are problems with the Iceberg View
> spec.  I think for the MV spec, as long as we don't propose something that
> involves SQL identifiers, then the MV spec isn't making this different
> engine problem worse.
>
> There's another issue I'd like to bring up about using UUIDs which is that
> these UUIDs are client generated and there's no validation that they are
> indeed globally unique identifiers.  The catalog just persists whatever it
> is given without validating that the UUIDs are indeed UUIDs and unique
> across the catalog.  (I know Nessie is not doing this validation).   We are
> assuming this UUID is not only unique within a catalog but is also unique
> across catalogs.  Thoughts on this?
>
> Thanks
> Benny
>
>
>
> On Wed, Sep 25, 2024 at 8:01 PM Steven Wu <[email protected]> wrote:
>
>> I agree that it is reasonable to assume/restrict view definition and
>> storage table in the same catalog. Hence the storage table reference in the
>> view metadata can include only namespace and table (excluding the engine
>> dependent catalog name/alias).
>>
>> Regarding the question of having lineage metadata in view definition vs
>> re-parsing SQL, I guess storing the lineage is an optimization that can
>> avoid recomputation/re-parsing. would be good to have more input.
>>
>> Thinking about catalog name/alias again. For the same engine (like
>> Spark), different applications/jobs may configure the catalog name
>> differently. E.g. Spark catalogs are configured using properties under
>> spark.sql.catalog.(catalog_name). We could also have the catalog
>> name/alias problem for the same engine.
>>
>>
>>
>>
>> On Fri, Sep 20, 2024 at 12:16 AM Jan Kaul <[email protected]>
>> <[email protected]> wrote:
>>
>>> Hi Walaa,
>>>
>>> It appears that you would like to maintain the lineage structure and not
>>> revert to parsing the SQL to obtain identifiers.
>>>
>>> Initially, one of the reasons for avoiding SQL parsing was to enable
>>> consumers who don't understand the SQL dialect of any representation to
>>> determine the freshness of the Materialized View (MV). However, with the
>>> "catalog alias" issue, having an identifier for some representation is
>>> insufficient, as the *catalog_name* is unlikely to work for the
>>> consumer. Therefore, supporting consumers that don't use a query engine of
>>> any representation seems impossible.
>>>
>>> Given this, parsing the SQL definition becomes a less significant
>>> drawback, as the consumer must understand the dialect anyway. In fact,
>>> simply parsing the SQL definition seems like a more robust and
>>> straightforward solution than using a lineage for every representation. I
>>> believe this is why Benny suggested reverting to SQL parsing, and I agree
>>> with him.
>>>
>>> Regarding the Storage table identifier: Its design as a
>>> *PartialIdentifier* with only namespace and name fields was
>>> intentional, to avoid the *catalog_name* issue.
>>>
>>> Best regards,
>>>
>>> Jan
>>> On 19.09.24 23:16, Benny Chow wrote:
>>>
>>> If Spark added the storage table identifier to the MV, I'm not sure how
>>> it could also add a full identifier to the Dremio representation.
>>> Spark doesn't know what name Dremio used for the catalog.
>>>
>>> For the UX issue, I think Jan cleverly called it a "PartialIdentifier"
>>> and not a "FullIdentifier" to indicate that catalog name is not even a
>>> property of the identifier.
>>>
>>> Requirement 3 is for the view's SQL.  I'm not sure there is a very
>>> strong use case to put the storage table into a different catalog than the
>>> view.  If we had an engine agnostic solution for it, I'm all for it
>>> though...
>>>
>>> Thanks
>>> Benny
>>>
>>>
>>> On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>>
>>>> I think the solution for the storage identifier might be shared with
>>>> the end state solution for the lineage. One could imagine a "full
>>>> identifier" can be used for the storage table; however, it is
>>>> "representation"-dependent (i.e., it changes according to
>>>> which representation it is part of, or rather which engine uses it).
>>>>
>>>> Also, are we asking engines (or their Iceberg implementation) to throw
>>>> an exception if the full storage table identifier was provided as part of
>>>> the MV definition? Sounds like a not very ideal UX. Note that it also
>>>> conflicts with the spirit of requirement #3.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>> On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <[email protected]> wrote:
>>>>
>>>>> Hi Jan
>>>>>
>>>>> "PartialIdentifier" without the catalog name sounds good to me.  The
>>>>> storage table and MV have to be in the same catalog.  That would be a good
>>>>> fifth requirement to add to the list.
>>>>>
>>>>> Thanks
>>>>> Benny
>>>>>
>>>>> On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul <[email protected]>
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Cool, I guess it's easier to resolve these kind of things when
>>>>>> talking in person.
>>>>>>
>>>>>> I agree with your requirements and the conclusion to use a map from
>>>>>> UUID to snapshot-id/version-id as the refresh-state, as well as dropping
>>>>>> the lineage in favor to just re-parsing the SQL query. This gets us 
>>>>>> around
>>>>>> the "catalog alias" issue.
>>>>>>
>>>>>> And I'm also OK with every engine requiring their own representation
>>>>>> to use the MV.
>>>>>>
>>>>>> There is still the issue with the identifier of the storage table and
>>>>>> its catalog_name. Should we use an "PartialIdentifier" with a namespace 
>>>>>> and
>>>>>> a name field, like so:
>>>>>>
>>>>>> {
>>>>>>
>>>>>>     namespace: ["bronze"],
>>>>>>
>>>>>>     name: "lineitem"
>>>>>>
>>>>>> }
>>>>>>
>>>>>> And require the storage table to be in the same catalog as the MV
>>>>>> itself?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jan
>>>>>> On 19.09.24 00:50, Benny Chow wrote:
>>>>>>
>>>>>> Steven and I met up yesterday at the Seattle Iceberg meetup and we
>>>>>> got to talking about the "catalog alias" issue.  He described it as an
>>>>>> annoying problem =p
>>>>>>
>>>>>> I think there are some key requirements we need to support:
>>>>>>
>>>>>> 1. Different engines can produce and consume shared MVs with
>>>>>> freshness validation.
>>>>>> 2. We cannot force different engines to standardize on the alias they
>>>>>> use for the catalog.
>>>>>> 3. We cannot force different SQL representations to exclude catalog
>>>>>> names from table identifiers or not use fully qualified table names.
>>>>>> 4. MV SQL can join tables and views from multiple catalogs ->
>>>>>> Inevitable with Nessie, Polaris, Unity, Tabular and others...
>>>>>>
>>>>>> The producing engine has to save refresh state information to let
>>>>>> consuming engine know that table X is at what snapshot at the time of
>>>>>> materialization.  The only way to identify this table across different
>>>>>> catalog names is to use the cross catalog, globally unique UUID.  I think
>>>>>> our only option is to have the refresh state map UUID to snapshot ids and
>>>>>> view version ids.
>>>>>>
>>>>>> Assuming the above is how we store the refresh state, how does the
>>>>>> consuming engine determine the current snapshot ids?  The consuming 
>>>>>> engine
>>>>>> will have to fully expand the query tree at which point it will have the
>>>>>> UUIDs as well as the latest snapshot ids/view versions.  This can then be
>>>>>> diffed against the materialization refresh state to determine freshness.
>>>>>> There isn't a need to store the view lineage information to map from UUID
>>>>>> to the consumer specific identifier so that the consumer can then call 
>>>>>> back
>>>>>> into the catalog with that identifier to get the latest state.  The
>>>>>> consuming engine might as well just re-parse the SQL and expand the 
>>>>>> query.
>>>>>>
>>>>>> Personally, I'm OK with requiring that an engine must have its own
>>>>>> SQL representation in order to use the MV.  To me, being able to fulfill
>>>>>> the key requirements above is much more important.
>>>>>>
>>>>>> Thanks
>>>>>> Benny
>>>>>>
>>>>>> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul
>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>
>>>>>>> How about we make the *catalog_name field* of the identifier
>>>>>>> optional? If the field is missing, it references a table/view in the 
>>>>>>> same
>>>>>>> catalog. If it is present it has to be an engine agnostic catalog name.
>>>>>>> Shouldn't the catalog_names from the REST catalog spec be engine 
>>>>>>> agnostic?
>>>>>>>
>>>>>>> I was wondering, is there no way to prescribe a catalog_name in
>>>>>>> Spark or Dremio? What do you do if you include two Nessie catalogs? They
>>>>>>> can't both be called LocalNessie.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jan
>>>>>>> On 14.09.24 01:23, Benny Chow wrote:
>>>>>>>
>>>>>>> The main reason for putting the lineage into the view is so that
>>>>>>> "another" engine can enumerate out the tables in the view without 
>>>>>>> needing
>>>>>>> to parse any SQL.  But, if we put the lineage under the SQL 
>>>>>>> representation
>>>>>>> with engine specific catalog names, the "other" engine is not going to 
>>>>>>> be
>>>>>>> able to use those identifiers to look up the tables.  The "other" engine
>>>>>>> can only lookup those identifiers using its engine specific catalog 
>>>>>>> name.
>>>>>>> It may be possible to enumerate the tables at the view version level 
>>>>>>> ONLY
>>>>>>> if those identifiers don't include the catalog name.  However, if you 
>>>>>>> have
>>>>>>> a view with a cross catalog join, then the tables coming from the other
>>>>>>> catalog have to be fully qualified.  But then the problem is that each
>>>>>>> engine will also alias the other catalog differently too.
>>>>>>>
>>>>>>> So, I think to summarize *multi-engine* view interoperability:
>>>>>>>
>>>>>>>    - default-catalog can't be specified
>>>>>>>    - default-namespace can be specified
>>>>>>>    - View SQL can only references tables/views from the same catalog
>>>>>>>
>>>>>>> I think these are reasonable constraints for multi-engine use
>>>>>>> cases.  If reasonable, for MVs, then the storage table, refresh-state 
>>>>>>> and
>>>>>>> lineage (at the view version level), could all be based on *engine
>>>>>>> agnostic* identifiers without the catalog name.  The MV and storage
>>>>>>> table would have to be in the same catalog.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Benny
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul
>>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> regarding our recent discussion on table identifiers with respect
>>>>>>>> to different catalog_names with different query engines. We have the 
>>>>>>>> same
>>>>>>>> problem when we want to reference the storage table from the common 
>>>>>>>> view.
>>>>>>>> *If we include the catalog_name as part of the identifier,
>>>>>>>> different query engines might not be able to load the storage table. *
>>>>>>>> We could enforce that every storage table has to be part of the
>>>>>>>> same catalog as the main view. This way an identifier without the
>>>>>>>> catalog_name would be enough to point to the correct storage table.
>>>>>>>>
>>>>>>>> What are your thoughts on this?
>>>>>>>>
>>>>>>>> Best wishes,
>>>>>>>>
>>>>>>>> Jan
>>>>>>>> On 11.09.24 16:05, Walaa Eldin Moustafa wrote:
>>>>>>>>
>>>>>>>> I think this type of discussion is exactly what motivates a
>>>>>>>> clarification in the view spec so that we can resolve MV lineage. Will
>>>>>>>> create separate thread for view spec clarification.
>>>>>>>>
>>>>>>>> Following up on Jan’s point, yes I agree in order to support
>>>>>>>> catalog name, it should be at the representation level, but catalog 
>>>>>>>> name
>>>>>>>> does not really depend on the “dialect” but rather on the “engine”; 
>>>>>>>> hence
>>>>>>>> the discussion becomes a little more involved.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>> On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul
>>>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Benny,
>>>>>>>>>
>>>>>>>>> I think that identifiers only being defined for a certain
>>>>>>>>> representation is exactly what we want. Each representation can define
>>>>>>>>> their own identifiers that then map to an UUID. This way the 
>>>>>>>>> "catalog_name"
>>>>>>>>> of the identifier for a "Spark" dialect can be different then for a
>>>>>>>>> "Dremio" dialect.
>>>>>>>>>
>>>>>>>>> The important part is that we still have a list of identifiers for
>>>>>>>>> each representation that we can use with the catalog to obtain the 
>>>>>>>>> state of
>>>>>>>>> the source tables.
>>>>>>>>>
>>>>>>>>> Best wishes,
>>>>>>>>>
>>>>>>>>> Jan
>>>>>>>>> On 11.09.24 01:33, Benny Chow wrote:
>>>>>>>>>
>>>>>>>>> Hi Walaa, I don't think the current view spec implicitly assumes a
>>>>>>>>> common catalog name between engines.  I tested this by not specifying 
>>>>>>>>> the
>>>>>>>>> default-catalog and both engines could look up the correct table 
>>>>>>>>> under the
>>>>>>>>> shared default-namespace even when each engine uses a different 
>>>>>>>>> catalog
>>>>>>>>> name.
>>>>>>>>>
>>>>>>>>> Hi Jan, I think the issue with putting the lineage as part of the
>>>>>>>>> representation is that that identifier only makes sense for that
>>>>>>>>> representation's engine.  In your example, the catalog aliased as 
>>>>>>>>> "iceberg"
>>>>>>>>> in spark is going to have a different name in Dremio or Trino.
>>>>>>>>>
>>>>>>>>> IMO, if we are to store a lineage for a view, it should consist of
>>>>>>>>> something engine agnostic like the table/view UUIDs.  This would be 
>>>>>>>>> stored
>>>>>>>>> at the view version level and not the representation level.  I think 
>>>>>>>>> as we
>>>>>>>>> get into more of these multi-engine, multi-catalog use cases for 
>>>>>>>>> views, the
>>>>>>>>> Iceberg Catalog is going to need to do a better job at handling CRUD 
>>>>>>>>> by
>>>>>>>>> UUID instead of engine specific identifiers.  Another scenario we 
>>>>>>>>> need to
>>>>>>>>> think through is a view that joins tables from two different 
>>>>>>>>> catalogs.  How
>>>>>>>>> would we represent the lineage for that in an engine agnostic way?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Benny
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul
>>>>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Walaa and Benny for clarifying the problem. I think I have
>>>>>>>>>> a better understanding now. Sorry for being a bit stubborn before.
>>>>>>>>>>
>>>>>>>>>> Wouldn't it make sense then to store the lineage as part of the
>>>>>>>>>> representation:
>>>>>>>>>>
>>>>>>>>>> {
>>>>>>>>>>
>>>>>>>>>>     "type": "sql",
>>>>>>>>>>
>>>>>>>>>>     "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM
>>>>>>>>>> events\nGROUP BY 2",
>>>>>>>>>>
>>>>>>>>>>     "dialect": "spark",
>>>>>>>>>>
>>>>>>>>>>     "lineage": [{
>>>>>>>>>>
>>>>>>>>>>         "identifier": { "catalog": "iceberg", "namespace":
>>>>>>>>>> "public", "table": "events"},
>>>>>>>>>>
>>>>>>>>>>         "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"
>>>>>>>>>>
>>>>>>>>>>     }]
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Best wishes,
>>>>>>>>>>
>>>>>>>>>> Jan
>>>>>>>>>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
>>>>>>>>>>
>>>>>>>>>> Benny, thank you so much for performing the experiment. Glad that
>>>>>>>>>> using UUIDs as keys in the state map makes more sense now.
>>>>>>>>>>
>>>>>>>>>> For the issue with the view spec being restrictive, I agree and I
>>>>>>>>>> have raised the concern on the view spec PR last year [1]. I think 
>>>>>>>>>> there is
>>>>>>>>>> some area of improvement here. At the least, if it is restrictive, it
>>>>>>>>>> should be explicitly stated. I will start a thread on how to 
>>>>>>>>>> approach the
>>>>>>>>>> view spec. We may need to get more insight on the view spec before
>>>>>>>>>> finalizing the MV spec, because view spec will determine if we should
>>>>>>>>>> proceed with one lineage (with the implicitly assumed common catalog 
>>>>>>>>>> name),
>>>>>>>>>> or with multiple lineages (one per engine or catalog name).
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Walaa
>>>>>>>>>>>
>>>>>>>>>>> I did some testing with two different engines (Spark and Dremio)
>>>>>>>>>>> against the same Nessie catalog and created the attached 
>>>>>>>>>>> materialized view
>>>>>>>>>>> metadata.json.  I see your point now about the SQL identifiers being
>>>>>>>>>>> tightly coupled to the engines.  In the metadata JSON, spark refers 
>>>>>>>>>>> to the
>>>>>>>>>>> catalog as "SparkNessie", whereas Dremio refers to the catalog as
>>>>>>>>>>> "LocalNessie".  So, this means that the fully qualified view and 
>>>>>>>>>>> table
>>>>>>>>>>> identifiers are engine specific and Dremio can't lookup a Spark 
>>>>>>>>>>> identifier
>>>>>>>>>>> and vice versa.
>>>>>>>>>>>
>>>>>>>>>>> *So, I think it does make sense now for the refresh-state to key
>>>>>>>>>>> off the UUIDs and not use engine specific identifiers.  *This
>>>>>>>>>>> also means that the materization consumer will have to fully expand 
>>>>>>>>>>> the
>>>>>>>>>>> query tree and basically diff the UUID + latest snapshot ids 
>>>>>>>>>>> against the
>>>>>>>>>>> refresh state.  Would it ever make sense for the Iceberg Catalog to 
>>>>>>>>>>> expose
>>>>>>>>>>> a bulk lookup API by UUID?
>>>>>>>>>>>
>>>>>>>>>>> As a side note, it seems that for a materialized view to work
>>>>>>>>>>> with multiple engines, the default-catalog and default-namespace 
>>>>>>>>>>> can't be
>>>>>>>>>>> used unless both engines use the same catalog name which seems 
>>>>>>>>>>> pretty
>>>>>>>>>>> restrictive to me.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the great discussions
>>>>>>>>>>> Benny
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Jan, we definitely can store SQL identifiers of multiple
>>>>>>>>>>>> representations in Approach 1.
>>>>>>>>>>>>
>>>>>>>>>>>> The takeaway is that SQL identifiers are highly coupled with
>>>>>>>>>>>> engines, just like views. It makes sense to track both together for
>>>>>>>>>>>> consistency.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
>>>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Walaa, thanks you for bringing up this use case. I think we
>>>>>>>>>>>>> need to keep in mind that we require identifiers to interface 
>>>>>>>>>>>>> with the
>>>>>>>>>>>>> catalog. We cannot use UUIDs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Which means you also wouldn't be able to use Approach 1 for
>>>>>>>>>>>>> your use case because you can't store the catalog names of 
>>>>>>>>>>>>> multiple
>>>>>>>>>>>>> representations in the lineage. You would need to fallback to 
>>>>>>>>>>>>> parsing the
>>>>>>>>>>>>> SQL for a particular representation and rebuilding the full query 
>>>>>>>>>>>>> tree to
>>>>>>>>>>>>> obtain the identifiers.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You could do the same for Approach 2. So I don't see why
>>>>>>>>>>>>> Approach 1 would yield any benefits.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jan
>>>>>>>>>>>>> On 07.09.24 00:01, Steven Wu wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Benny, `default-catalog` is optional, while
>>>>>>>>>>>>> `default-namespace` is required.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I will retract my comment on the `summary`. it indicates the
>>>>>>>>>>>>> engine that made the revision to the current view version. it 
>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>> really matter for multi-engine/representation support.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope
>>>>>>>>>>>>>> it wouldn't have to be under a specific representation.
>>>>>>>>>>>>>> Walaa - That's a serious concern...  If the same catalog is
>>>>>>>>>>>>>> aliased differently by two different engines, then the basic 
>>>>>>>>>>>>>> view spec
>>>>>>>>>>>>>> seems broken to me since "default-namespace" includes the 
>>>>>>>>>>>>>> catalog alias and
>>>>>>>>>>>>>> is outside of the SQL representation.  Does that mean for a view 
>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>> interoperable, we require different engines to use the same 
>>>>>>>>>>>>>> catalog name?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Walaa, thanks for bringing up the interesting case of
>>>>>>>>>>>>>>> multiple representations (for different engines), which 
>>>>>>>>>>>>>>> definitely requires
>>>>>>>>>>>>>>> more discussion from the community.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When I am looking at the view spec, I am seeing some
>>>>>>>>>>>>>>> conflict. "summary" field seems meant for only one engine, while
>>>>>>>>>>>>>>> "representations" support multiple engines.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "summary" : {
>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16>
>>>>>>>>>>>>>>> "engine-name" : "Spark",
>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17>
>>>>>>>>>>>>>>> "engineVersion" : "3.3.2"
>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18>
>>>>>>>>>>>>>>> }, <https://iceberg.apache.org/view-spec/#__codelineno-5-19>
>>>>>>>>>>>>>>> "representations" : [ {
>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20>
>>>>>>>>>>>>>>> "type" : "sql",
>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-21>
>>>>>>>>>>>>>>> "sql" : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM 
>>>>>>>>>>>>>>> events\nGROUP BY
>>>>>>>>>>>>>>> 2",
>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-22>
>>>>>>>>>>>>>>> "dialect" : "spark"
>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> }
>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With multiple representations/engines, I guess one engine
>>>>>>>>>>>>>>> will be responsible for the storage table refresh and other 
>>>>>>>>>>>>>>> engines are
>>>>>>>>>>>>>>> read only. If we want to store the lineage info in the view, it 
>>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>> needs to be part of the "representation" struct so that each
>>>>>>>>>>>>>>> engine/representation stores its own lineage info..
>>>>>>>>>>>>>>> Who is to validate/ensure that the SQL representation is
>>>>>>>>>>>>>>> actually semantically identical (minus syntax differences 
>>>>>>>>>>>>>>> across engines)?
>>>>>>>>>>>>>>> I guess this responsibility is left to the user who owns and 
>>>>>>>>>>>>>>> manages the
>>>>>>>>>>>>>>> view.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to