Re: [DISCUSS] Iceberg Materialzied Views

Steven Wu Wed, 25 Sep 2024 20:01:43 -0700

I agree that it is reasonable to assume/restrict view definition and
storage table in the same catalog. Hence the storage table reference in the
view metadata can include only namespace and table (excluding the engine
dependent catalog name/alias).


Regarding the question of having lineage metadata in view definition vs
re-parsing SQL, I guess storing the lineage is an optimization that can
avoid recomputation/re-parsing. would be good to have more input.

Thinking about catalog name/alias again. For the same engine (like Spark),
different applications/jobs may configure the catalog name differently.
E.g. Spark catalogs are configured using properties under
spark.sql.catalog.(catalog_name). We could also have the catalog name/alias
problem for the same engine.




On Fri, Sep 20, 2024 at 12:16 AM Jan Kaul <[email protected]>
wrote:

> Hi Walaa,
>
> It appears that you would like to maintain the lineage structure and not
> revert to parsing the SQL to obtain identifiers.
>
> Initially, one of the reasons for avoiding SQL parsing was to enable
> consumers who don't understand the SQL dialect of any representation to
> determine the freshness of the Materialized View (MV). However, with the
> "catalog alias" issue, having an identifier for some representation is
> insufficient, as the *catalog_name* is unlikely to work for the consumer.
> Therefore, supporting consumers that don't use a query engine of any
> representation seems impossible.
>
> Given this, parsing the SQL definition becomes a less significant
> drawback, as the consumer must understand the dialect anyway. In fact,
> simply parsing the SQL definition seems like a more robust and
> straightforward solution than using a lineage for every representation. I
> believe this is why Benny suggested reverting to SQL parsing, and I agree
> with him.
>
> Regarding the Storage table identifier: Its design as a
> *PartialIdentifier* with only namespace and name fields was intentional,
> to avoid the *catalog_name* issue.
>
> Best regards,
>
> Jan
> On 19.09.24 23:16, Benny Chow wrote:
>
> If Spark added the storage table identifier to the MV, I'm not sure how it
> could also add a full identifier to the Dremio representation.
> Spark doesn't know what name Dremio used for the catalog.
>
> For the UX issue, I think Jan cleverly called it a "PartialIdentifier" and
> not a "FullIdentifier" to indicate that catalog name is not even a property
> of the identifier.
>
> Requirement 3 is for the view's SQL.  I'm not sure there is a very strong
> use case to put the storage table into a different catalog than the view.
> If we had an engine agnostic solution for it, I'm all for it though...
>
> Thanks
> Benny
>
>
> On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
>
>> I think the solution for the storage identifier might be shared with the
>> end state solution for the lineage. One could imagine a "full identifier"
>> can be used for the storage table; however, it is
>> "representation"-dependent (i.e., it changes according to
>> which representation it is part of, or rather which engine uses it).
>>
>> Also, are we asking engines (or their Iceberg implementation) to throw an
>> exception if the full storage table identifier was provided as part of the
>> MV definition? Sounds like a not very ideal UX. Note that it also conflicts
>> with the spirit of requirement #3.
>>
>> Thanks,
>> Walaa.
>>
>> On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <[email protected]> wrote:
>>
>>> Hi Jan
>>>
>>> "PartialIdentifier" without the catalog name sounds good to me.  The
>>> storage table and MV have to be in the same catalog.  That would be a good
>>> fifth requirement to add to the list.
>>>
>>> Thanks
>>> Benny
>>>
>>> On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul <[email protected]>
>>> <[email protected]> wrote:
>>>
>>>> Cool, I guess it's easier to resolve these kind of things when talking
>>>> in person.
>>>>
>>>> I agree with your requirements and the conclusion to use a map from
>>>> UUID to snapshot-id/version-id as the refresh-state, as well as dropping
>>>> the lineage in favor to just re-parsing the SQL query. This gets us around
>>>> the "catalog alias" issue.
>>>>
>>>> And I'm also OK with every engine requiring their own representation to
>>>> use the MV.
>>>>
>>>> There is still the issue with the identifier of the storage table and
>>>> its catalog_name. Should we use an "PartialIdentifier" with a namespace and
>>>> a name field, like so:
>>>>
>>>> {
>>>>
>>>>     namespace: ["bronze"],
>>>>
>>>>     name: "lineitem"
>>>>
>>>> }
>>>>
>>>> And require the storage table to be in the same catalog as the MV
>>>> itself?
>>>>
>>>> Thanks,
>>>>
>>>> Jan
>>>> On 19.09.24 00:50, Benny Chow wrote:
>>>>
>>>> Steven and I met up yesterday at the Seattle Iceberg meetup and we got
>>>> to talking about the "catalog alias" issue.  He described it as an annoying
>>>> problem =p
>>>>
>>>> I think there are some key requirements we need to support:
>>>>
>>>> 1. Different engines can produce and consume shared MVs with freshness
>>>> validation.
>>>> 2. We cannot force different engines to standardize on the alias they
>>>> use for the catalog.
>>>> 3. We cannot force different SQL representations to exclude catalog
>>>> names from table identifiers or not use fully qualified table names.
>>>> 4. MV SQL can join tables and views from multiple catalogs ->
>>>> Inevitable with Nessie, Polaris, Unity, Tabular and others...
>>>>
>>>> The producing engine has to save refresh state information to let
>>>> consuming engine know that table X is at what snapshot at the time of
>>>> materialization.  The only way to identify this table across different
>>>> catalog names is to use the cross catalog, globally unique UUID.  I think
>>>> our only option is to have the refresh state map UUID to snapshot ids and
>>>> view version ids.
>>>>
>>>> Assuming the above is how we store the refresh state, how does the
>>>> consuming engine determine the current snapshot ids?  The consuming engine
>>>> will have to fully expand the query tree at which point it will have the
>>>> UUIDs as well as the latest snapshot ids/view versions.  This can then be
>>>> diffed against the materialization refresh state to determine freshness.
>>>> There isn't a need to store the view lineage information to map from UUID
>>>> to the consumer specific identifier so that the consumer can then call back
>>>> into the catalog with that identifier to get the latest state.  The
>>>> consuming engine might as well just re-parse the SQL and expand the query.
>>>>
>>>> Personally, I'm OK with requiring that an engine must have its own SQL
>>>> representation in order to use the MV.  To me, being able to fulfill the
>>>> key requirements above is much more important.
>>>>
>>>> Thanks
>>>> Benny
>>>>
>>>> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul <[email protected]>
>>>> <[email protected]> wrote:
>>>>
>>>>> How about we make the *catalog_name field* of the identifier
>>>>> optional? If the field is missing, it references a table/view in the same
>>>>> catalog. If it is present it has to be an engine agnostic catalog name.
>>>>> Shouldn't the catalog_names from the REST catalog spec be engine agnostic?
>>>>>
>>>>> I was wondering, is there no way to prescribe a catalog_name in Spark
>>>>> or Dremio? What do you do if you include two Nessie catalogs? They can't
>>>>> both be called LocalNessie.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jan
>>>>> On 14.09.24 01:23, Benny Chow wrote:
>>>>>
>>>>> The main reason for putting the lineage into the view is so that
>>>>> "another" engine can enumerate out the tables in the view without needing
>>>>> to parse any SQL.  But, if we put the lineage under the SQL representation
>>>>> with engine specific catalog names, the "other" engine is not going to be
>>>>> able to use those identifiers to look up the tables.  The "other" engine
>>>>> can only lookup those identifiers using its engine specific catalog name.
>>>>> It may be possible to enumerate the tables at the view version level ONLY
>>>>> if those identifiers don't include the catalog name.  However, if you have
>>>>> a view with a cross catalog join, then the tables coming from the other
>>>>> catalog have to be fully qualified.  But then the problem is that each
>>>>> engine will also alias the other catalog differently too.
>>>>>
>>>>> So, I think to summarize *multi-engine* view interoperability:
>>>>>
>>>>>    - default-catalog can't be specified
>>>>>    - default-namespace can be specified
>>>>>    - View SQL can only references tables/views from the same catalog
>>>>>
>>>>> I think these are reasonable constraints for multi-engine use cases.
>>>>> If reasonable, for MVs, then the storage table, refresh-state and lineage
>>>>> (at the view version level), could all be based on *engine agnostic*
>>>>> identifiers without the catalog name.  The MV and storage table would have
>>>>> to be in the same catalog.
>>>>>
>>>>> Thanks
>>>>> Benny
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul <[email protected]>
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> regarding our recent discussion on table identifiers with respect to
>>>>>> different catalog_names with different query engines. We have the same
>>>>>> problem when we want to reference the storage table from the common view.
>>>>>> *If we include the catalog_name as part of the identifier, different
>>>>>> query engines might not be able to load the storage table. *
>>>>>> We could enforce that every storage table has to be part of the same
>>>>>> catalog as the main view. This way an identifier without the catalog_name
>>>>>> would be enough to point to the correct storage table.
>>>>>>
>>>>>> What are your thoughts on this?
>>>>>>
>>>>>> Best wishes,
>>>>>>
>>>>>> Jan
>>>>>> On 11.09.24 16:05, Walaa Eldin Moustafa wrote:
>>>>>>
>>>>>> I think this type of discussion is exactly what motivates a
>>>>>> clarification in the view spec so that we can resolve MV lineage. Will
>>>>>> create separate thread for view spec clarification.
>>>>>>
>>>>>> Following up on Jan’s point, yes I agree in order to support catalog
>>>>>> name, it should be at the representation level, but catalog name does not
>>>>>> really depend on the “dialect” but rather on the “engine”; hence the
>>>>>> discussion becomes a little more involved.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>> On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul
>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Benny,
>>>>>>>
>>>>>>> I think that identifiers only being defined for a certain
>>>>>>> representation is exactly what we want. Each representation can define
>>>>>>> their own identifiers that then map to an UUID. This way the 
>>>>>>> "catalog_name"
>>>>>>> of the identifier for a "Spark" dialect can be different then for a
>>>>>>> "Dremio" dialect.
>>>>>>>
>>>>>>> The important part is that we still have a list of identifiers for
>>>>>>> each representation that we can use with the catalog to obtain the 
>>>>>>> state of
>>>>>>> the source tables.
>>>>>>>
>>>>>>> Best wishes,
>>>>>>>
>>>>>>> Jan
>>>>>>> On 11.09.24 01:33, Benny Chow wrote:
>>>>>>>
>>>>>>> Hi Walaa, I don't think the current view spec implicitly assumes a
>>>>>>> common catalog name between engines.  I tested this by not specifying 
>>>>>>> the
>>>>>>> default-catalog and both engines could look up the correct table under 
>>>>>>> the
>>>>>>> shared default-namespace even when each engine uses a different catalog
>>>>>>> name.
>>>>>>>
>>>>>>> Hi Jan, I think the issue with putting the lineage as part of the
>>>>>>> representation is that that identifier only makes sense for that
>>>>>>> representation's engine.  In your example, the catalog aliased as 
>>>>>>> "iceberg"
>>>>>>> in spark is going to have a different name in Dremio or Trino.
>>>>>>>
>>>>>>> IMO, if we are to store a lineage for a view, it should consist of
>>>>>>> something engine agnostic like the table/view UUIDs.  This would be 
>>>>>>> stored
>>>>>>> at the view version level and not the representation level.  I think as 
>>>>>>> we
>>>>>>> get into more of these multi-engine, multi-catalog use cases for views, 
>>>>>>> the
>>>>>>> Iceberg Catalog is going to need to do a better job at handling CRUD by
>>>>>>> UUID instead of engine specific identifiers.  Another scenario we need 
>>>>>>> to
>>>>>>> think through is a view that joins tables from two different catalogs.  
>>>>>>> How
>>>>>>> would we represent the lineage for that in an engine agnostic way?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Benny
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul
>>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks Walaa and Benny for clarifying the problem. I think I have a
>>>>>>>> better understanding now. Sorry for being a bit stubborn before.
>>>>>>>>
>>>>>>>> Wouldn't it make sense then to store the lineage as part of the
>>>>>>>> representation:
>>>>>>>>
>>>>>>>> {
>>>>>>>>
>>>>>>>>     "type": "sql",
>>>>>>>>
>>>>>>>>     "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM
>>>>>>>> events\nGROUP BY 2",
>>>>>>>>
>>>>>>>>     "dialect": "spark",
>>>>>>>>
>>>>>>>>     "lineage": [{
>>>>>>>>
>>>>>>>>         "identifier": { "catalog": "iceberg", "namespace":
>>>>>>>> "public", "table": "events"},
>>>>>>>>
>>>>>>>>         "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"
>>>>>>>>
>>>>>>>>     }]
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> Best wishes,
>>>>>>>>
>>>>>>>> Jan
>>>>>>>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
>>>>>>>>
>>>>>>>> Benny, thank you so much for performing the experiment. Glad that
>>>>>>>> using UUIDs as keys in the state map makes more sense now.
>>>>>>>>
>>>>>>>> For the issue with the view spec being restrictive, I agree and I
>>>>>>>> have raised the concern on the view spec PR last year [1]. I think 
>>>>>>>> there is
>>>>>>>> some area of improvement here. At the least, if it is restrictive, it
>>>>>>>> should be explicitly stated. I will start a thread on how to approach 
>>>>>>>> the
>>>>>>>> view spec. We may need to get more insight on the view spec before
>>>>>>>> finalizing the MV spec, because view spec will determine if we should
>>>>>>>> proceed with one lineage (with the implicitly assumed common catalog 
>>>>>>>> name),
>>>>>>>> or with multiple lineages (one per engine or catalog name).
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Walaa
>>>>>>>>>
>>>>>>>>> I did some testing with two different engines (Spark and Dremio)
>>>>>>>>> against the same Nessie catalog and created the attached materialized 
>>>>>>>>> view
>>>>>>>>> metadata.json.  I see your point now about the SQL identifiers being
>>>>>>>>> tightly coupled to the engines.  In the metadata JSON, spark refers 
>>>>>>>>> to the
>>>>>>>>> catalog as "SparkNessie", whereas Dremio refers to the catalog as
>>>>>>>>> "LocalNessie".  So, this means that the fully qualified view and table
>>>>>>>>> identifiers are engine specific and Dremio can't lookup a Spark 
>>>>>>>>> identifier
>>>>>>>>> and vice versa.
>>>>>>>>>
>>>>>>>>> *So, I think it does make sense now for the refresh-state to key
>>>>>>>>> off the UUIDs and not use engine specific identifiers.  *This
>>>>>>>>> also means that the materization consumer will have to fully expand 
>>>>>>>>> the
>>>>>>>>> query tree and basically diff the UUID + latest snapshot ids against 
>>>>>>>>> the
>>>>>>>>> refresh state.  Would it ever make sense for the Iceberg Catalog to 
>>>>>>>>> expose
>>>>>>>>> a bulk lookup API by UUID?
>>>>>>>>>
>>>>>>>>> As a side note, it seems that for a materialized view to work with
>>>>>>>>> multiple engines, the default-catalog and default-namespace can't be 
>>>>>>>>> used
>>>>>>>>> unless both engines use the same catalog name which seems pretty
>>>>>>>>> restrictive to me.
>>>>>>>>>
>>>>>>>>> Thanks for the great discussions
>>>>>>>>> Benny
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Jan, we definitely can store SQL identifiers of multiple
>>>>>>>>>> representations in Approach 1.
>>>>>>>>>>
>>>>>>>>>> The takeaway is that SQL identifiers are highly coupled with
>>>>>>>>>> engines, just like views. It makes sense to track both together for
>>>>>>>>>> consistency.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
>>>>>>>>>> <[email protected]> <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Walaa, thanks you for bringing up this use case. I think we need
>>>>>>>>>>> to keep in mind that we require identifiers to interface with the 
>>>>>>>>>>> catalog.
>>>>>>>>>>> We cannot use UUIDs.
>>>>>>>>>>>
>>>>>>>>>>> Which means you also wouldn't be able to use Approach 1 for your
>>>>>>>>>>> use case because you can't store the catalog names of multiple
>>>>>>>>>>> representations in the lineage. You would need to fallback to 
>>>>>>>>>>> parsing the
>>>>>>>>>>> SQL for a particular representation and rebuilding the full query 
>>>>>>>>>>> tree to
>>>>>>>>>>> obtain the identifiers.
>>>>>>>>>>>
>>>>>>>>>>> You could do the same for Approach 2. So I don't see why
>>>>>>>>>>> Approach 1 would yield any benefits.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Jan
>>>>>>>>>>> On 07.09.24 00:01, Steven Wu wrote:
>>>>>>>>>>>
>>>>>>>>>>> Benny, `default-catalog` is optional, while `default-namespace`
>>>>>>>>>>> is required.
>>>>>>>>>>>
>>>>>>>>>>> I will retract my comment on the `summary`. it indicates the
>>>>>>>>>>> engine that made the revision to the current view version. it 
>>>>>>>>>>> doesn't
>>>>>>>>>>> really matter for multi-engine/representation support.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it
>>>>>>>>>>>> wouldn't have to be under a specific representation.
>>>>>>>>>>>> Walaa - That's a serious concern...  If the same catalog is
>>>>>>>>>>>> aliased differently by two different engines, then the basic view 
>>>>>>>>>>>> spec
>>>>>>>>>>>> seems broken to me since "default-namespace" includes the catalog 
>>>>>>>>>>>> alias and
>>>>>>>>>>>> is outside of the SQL representation.  Does that mean for a view 
>>>>>>>>>>>> to be
>>>>>>>>>>>> interoperable, we require different engines to use the same 
>>>>>>>>>>>> catalog name?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Walaa, thanks for bringing up the interesting case of multiple
>>>>>>>>>>>>> representations (for different engines), which definitely 
>>>>>>>>>>>>> requires more
>>>>>>>>>>>>> discussion from the community.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When I am looking at the view spec, I am seeing some conflict.
>>>>>>>>>>>>> "summary" field seems meant for only one engine, while 
>>>>>>>>>>>>> "representations"
>>>>>>>>>>>>> support multiple engines.
>>>>>>>>>>>>>
>>>>>>>>>>>>> "summary" : {
>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16>
>>>>>>>>>>>>> "engine-name" : "Spark",
>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17>
>>>>>>>>>>>>> "engineVersion" : "3.3.2"
>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> },
>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19>
>>>>>>>>>>>>> "representations" : [ {
>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20>
>>>>>>>>>>>>> "type" : "sql",
>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-21>
>>>>>>>>>>>>> "sql" : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM 
>>>>>>>>>>>>> events\nGROUP BY
>>>>>>>>>>>>> 2", <https://iceberg.apache.org/view-spec/#__codelineno-5-22>
>>>>>>>>>>>>> "dialect" : "spark"
>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ]
>>>>>>>>>>>>>
>>>>>>>>>>>>> With multiple representations/engines, I guess one engine will
>>>>>>>>>>>>> be responsible for the storage table refresh and other engines 
>>>>>>>>>>>>> are read
>>>>>>>>>>>>> only. If we want to store the lineage info in the view, it 
>>>>>>>>>>>>> probably
>>>>>>>>>>>>> needs to be part of the "representation" struct so that each
>>>>>>>>>>>>> engine/representation stores its own lineage info..
>>>>>>>>>>>>> Who is to validate/ensure that the SQL representation is
>>>>>>>>>>>>> actually semantically identical (minus syntax differences across 
>>>>>>>>>>>>> engines)?
>>>>>>>>>>>>> I guess this responsibility is left to the user who owns and 
>>>>>>>>>>>>> manages the
>>>>>>>>>>>>> view.
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to