Re: [DISCUSS] Iceberg Materialzied Views

Benny Chow Thu, 19 Sep 2024 14:17:03 -0700

If Spark added the storage table identifier to the MV, I'm not sure how it
could also add a full identifier to the Dremio representation.
Spark doesn't know what name Dremio used for the catalog.


For the UX issue, I think Jan cleverly called it a "PartialIdentifier" and
not a "FullIdentifier" to indicate that catalog name is not even a property
of the identifier.

Requirement 3 is for the view's SQL.  I'm not sure there is a very strong
use case to put the storage table into a different catalog than the view.
If we had an engine agnostic solution for it, I'm all for it though...

Thanks
Benny


On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa <[email protected]>
wrote:

> I think the solution for the storage identifier might be shared with the
> end state solution for the lineage. One could imagine a "full identifier"
> can be used for the storage table; however, it is
> "representation"-dependent (i.e., it changes according to
> which representation it is part of, or rather which engine uses it).
>
> Also, are we asking engines (or their Iceberg implementation) to throw an
> exception if the full storage table identifier was provided as part of the
> MV definition? Sounds like a not very ideal UX. Note that it also conflicts
> with the spirit of requirement #3.
>
> Thanks,
> Walaa.
>
> On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <[email protected]> wrote:
>
>> Hi Jan
>>
>> "PartialIdentifier" without the catalog name sounds good to me.  The
>> storage table and MV have to be in the same catalog.  That would be a good
>> fifth requirement to add to the list.
>>
>> Thanks
>> Benny
>>
>> On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul <[email protected]>
>> wrote:
>>
>>> Cool, I guess it's easier to resolve these kind of things when talking
>>> in person.
>>>
>>> I agree with your requirements and the conclusion to use a map from UUID
>>> to snapshot-id/version-id as the refresh-state, as well as dropping the
>>> lineage in favor to just re-parsing the SQL query. This gets us around the
>>> "catalog alias" issue.
>>>
>>> And I'm also OK with every engine requiring their own representation to
>>> use the MV.
>>>
>>> There is still the issue with the identifier of the storage table and
>>> its catalog_name. Should we use an "PartialIdentifier" with a namespace and
>>> a name field, like so:
>>>
>>> {
>>>
>>>     namespace: ["bronze"],
>>>
>>>     name: "lineitem"
>>>
>>> }
>>>
>>> And require the storage table to be in the same catalog as the MV itself?
>>>
>>> Thanks,
>>>
>>> Jan
>>> On 19.09.24 00:50, Benny Chow wrote:
>>>
>>> Steven and I met up yesterday at the Seattle Iceberg meetup and we got
>>> to talking about the "catalog alias" issue.  He described it as an annoying
>>> problem =p
>>>
>>> I think there are some key requirements we need to support:
>>>
>>> 1. Different engines can produce and consume shared MVs with freshness
>>> validation.
>>> 2. We cannot force different engines to standardize on the alias they
>>> use for the catalog.
>>> 3. We cannot force different SQL representations to exclude catalog
>>> names from table identifiers or not use fully qualified table names.
>>> 4. MV SQL can join tables and views from multiple catalogs -> Inevitable
>>> with Nessie, Polaris, Unity, Tabular and others...
>>>
>>> The producing engine has to save refresh state information to let
>>> consuming engine know that table X is at what snapshot at the time of
>>> materialization.  The only way to identify this table across different
>>> catalog names is to use the cross catalog, globally unique UUID.  I think
>>> our only option is to have the refresh state map UUID to snapshot ids and
>>> view version ids.
>>>
>>> Assuming the above is how we store the refresh state, how does the
>>> consuming engine determine the current snapshot ids?  The consuming engine
>>> will have to fully expand the query tree at which point it will have the
>>> UUIDs as well as the latest snapshot ids/view versions.  This can then be
>>> diffed against the materialization refresh state to determine freshness.
>>> There isn't a need to store the view lineage information to map from UUID
>>> to the consumer specific identifier so that the consumer can then call back
>>> into the catalog with that identifier to get the latest state.  The
>>> consuming engine might as well just re-parse the SQL and expand the query.
>>>
>>> Personally, I'm OK with requiring that an engine must have its own SQL
>>> representation in order to use the MV.  To me, being able to fulfill the
>>> key requirements above is much more important.
>>>
>>> Thanks
>>> Benny
>>>
>>> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul <[email protected]>
>>> <[email protected]> wrote:
>>>
>>>> How about we make the *catalog_name field* of the identifier optional?
>>>> If the field is missing, it references a table/view in the same catalog. If
>>>> it is present it has to be an engine agnostic catalog name. Shouldn't the
>>>> catalog_names from the REST catalog spec be engine agnostic?
>>>>
>>>> I was wondering, is there no way to prescribe a catalog_name in Spark
>>>> or Dremio? What do you do if you include two Nessie catalogs? They can't
>>>> both be called LocalNessie.
>>>>
>>>> Thanks,
>>>>
>>>> Jan
>>>> On 14.09.24 01:23, Benny Chow wrote:
>>>>
>>>> The main reason for putting the lineage into the view is so that
>>>> "another" engine can enumerate out the tables in the view without needing
>>>> to parse any SQL.  But, if we put the lineage under the SQL representation
>>>> with engine specific catalog names, the "other" engine is not going to be
>>>> able to use those identifiers to look up the tables.  The "other" engine
>>>> can only lookup those identifiers using its engine specific catalog name.
>>>> It may be possible to enumerate the tables at the view version level ONLY
>>>> if those identifiers don't include the catalog name.  However, if you have
>>>> a view with a cross catalog join, then the tables coming from the other
>>>> catalog have to be fully qualified.  But then the problem is that each
>>>> engine will also alias the other catalog differently too.
>>>>
>>>> So, I think to summarize *multi-engine* view interoperability:
>>>>
>>>>    - default-catalog can't be specified
>>>>    - default-namespace can be specified
>>>>    - View SQL can only references tables/views from the same catalog
>>>>
>>>> I think these are reasonable constraints for multi-engine use cases.
>>>> If reasonable, for MVs, then the storage table, refresh-state and lineage
>>>> (at the view version level), could all be based on *engine agnostic*
>>>> identifiers without the catalog name.  The MV and storage table would have
>>>> to be in the same catalog.
>>>>
>>>> Thanks
>>>> Benny
>>>>
>>>>
>>>>
>>>> On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul <[email protected]>
>>>> <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> regarding our recent discussion on table identifiers with respect to
>>>>> different catalog_names with different query engines. We have the same
>>>>> problem when we want to reference the storage table from the common view.
>>>>> *If we include the catalog_name as part of the identifier, different
>>>>> query engines might not be able to load the storage table. *
>>>>> We could enforce that every storage table has to be part of the same
>>>>> catalog as the main view. This way an identifier without the catalog_name
>>>>> would be enough to point to the correct storage table.
>>>>>
>>>>> What are your thoughts on this?
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Jan
>>>>> On 11.09.24 16:05, Walaa Eldin Moustafa wrote:
>>>>>
>>>>> I think this type of discussion is exactly what motivates a
>>>>> clarification in the view spec so that we can resolve MV lineage. Will
>>>>> create separate thread for view spec clarification.
>>>>>
>>>>> Following up on Jan’s point, yes I agree in order to support catalog
>>>>> name, it should be at the representation level, but catalog name does not
>>>>> really depend on the “dialect” but rather on the “engine”; hence the
>>>>> discussion becomes a little more involved.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>> On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul <[email protected]>
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Hi Benny,
>>>>>>
>>>>>> I think that identifiers only being defined for a certain
>>>>>> representation is exactly what we want. Each representation can define
>>>>>> their own identifiers that then map to an UUID. This way the 
>>>>>> "catalog_name"
>>>>>> of the identifier for a "Spark" dialect can be different then for a
>>>>>> "Dremio" dialect.
>>>>>>
>>>>>> The important part is that we still have a list of identifiers for
>>>>>> each representation that we can use with the catalog to obtain the state 
>>>>>> of
>>>>>> the source tables.
>>>>>>
>>>>>> Best wishes,
>>>>>>
>>>>>> Jan
>>>>>> On 11.09.24 01:33, Benny Chow wrote:
>>>>>>
>>>>>> Hi Walaa, I don't think the current view spec implicitly assumes a
>>>>>> common catalog name between engines.  I tested this by not specifying the
>>>>>> default-catalog and both engines could look up the correct table under 
>>>>>> the
>>>>>> shared default-namespace even when each engine uses a different catalog
>>>>>> name.
>>>>>>
>>>>>> Hi Jan, I think the issue with putting the lineage as part of the
>>>>>> representation is that that identifier only makes sense for that
>>>>>> representation's engine.  In your example, the catalog aliased as 
>>>>>> "iceberg"
>>>>>> in spark is going to have a different name in Dremio or Trino.
>>>>>>
>>>>>> IMO, if we are to store a lineage for a view, it should consist of
>>>>>> something engine agnostic like the table/view UUIDs.  This would be 
>>>>>> stored
>>>>>> at the view version level and not the representation level.  I think as 
>>>>>> we
>>>>>> get into more of these multi-engine, multi-catalog use cases for views, 
>>>>>> the
>>>>>> Iceberg Catalog is going to need to do a better job at handling CRUD by
>>>>>> UUID instead of engine specific identifiers.  Another scenario we need to
>>>>>> think through is a view that joins tables from two different catalogs.  
>>>>>> How
>>>>>> would we represent the lineage for that in an engine agnostic way?
>>>>>>
>>>>>> Thanks
>>>>>> Benny
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul
>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks Walaa and Benny for clarifying the problem. I think I have a
>>>>>>> better understanding now. Sorry for being a bit stubborn before.
>>>>>>>
>>>>>>> Wouldn't it make sense then to store the lineage as part of the
>>>>>>> representation:
>>>>>>>
>>>>>>> {
>>>>>>>
>>>>>>>     "type": "sql",
>>>>>>>
>>>>>>>     "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM
>>>>>>> events\nGROUP BY 2",
>>>>>>>
>>>>>>>     "dialect": "spark",
>>>>>>>
>>>>>>>     "lineage": [{
>>>>>>>
>>>>>>>         "identifier": { "catalog": "iceberg", "namespace": "public",
>>>>>>> "table": "events"},
>>>>>>>
>>>>>>>         "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"
>>>>>>>
>>>>>>>     }]
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> Best wishes,
>>>>>>>
>>>>>>> Jan
>>>>>>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
>>>>>>>
>>>>>>> Benny, thank you so much for performing the experiment. Glad that
>>>>>>> using UUIDs as keys in the state map makes more sense now.
>>>>>>>
>>>>>>> For the issue with the view spec being restrictive, I agree and I
>>>>>>> have raised the concern on the view spec PR last year [1]. I think 
>>>>>>> there is
>>>>>>> some area of improvement here. At the least, if it is restrictive, it
>>>>>>> should be explicitly stated. I will start a thread on how to approach 
>>>>>>> the
>>>>>>> view spec. We may need to get more insight on the view spec before
>>>>>>> finalizing the MV spec, because view spec will determine if we should
>>>>>>> proceed with one lineage (with the implicitly assumed common catalog 
>>>>>>> name),
>>>>>>> or with multiple lineages (one per engine or catalog name).
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Walaa
>>>>>>>>
>>>>>>>> I did some testing with two different engines (Spark and Dremio)
>>>>>>>> against the same Nessie catalog and created the attached materialized 
>>>>>>>> view
>>>>>>>> metadata.json.  I see your point now about the SQL identifiers being
>>>>>>>> tightly coupled to the engines.  In the metadata JSON, spark refers to 
>>>>>>>> the
>>>>>>>> catalog as "SparkNessie", whereas Dremio refers to the catalog as
>>>>>>>> "LocalNessie".  So, this means that the fully qualified view and table
>>>>>>>> identifiers are engine specific and Dremio can't lookup a Spark 
>>>>>>>> identifier
>>>>>>>> and vice versa.
>>>>>>>>
>>>>>>>> *So, I think it does make sense now for the refresh-state to key
>>>>>>>> off the UUIDs and not use engine specific identifiers.  *This also
>>>>>>>> means that the materization consumer will have to fully expand the 
>>>>>>>> query
>>>>>>>> tree and basically diff the UUID + latest snapshot ids against the 
>>>>>>>> refresh
>>>>>>>> state.  Would it ever make sense for the Iceberg Catalog to expose a 
>>>>>>>> bulk
>>>>>>>> lookup API by UUID?
>>>>>>>>
>>>>>>>> As a side note, it seems that for a materialized view to work with
>>>>>>>> multiple engines, the default-catalog and default-namespace can't be 
>>>>>>>> used
>>>>>>>> unless both engines use the same catalog name which seems pretty
>>>>>>>> restrictive to me.
>>>>>>>>
>>>>>>>> Thanks for the great discussions
>>>>>>>> Benny
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Jan, we definitely can store SQL identifiers of multiple
>>>>>>>>> representations in Approach 1.
>>>>>>>>>
>>>>>>>>> The takeaway is that SQL identifiers are highly coupled with
>>>>>>>>> engines, just like views. It makes sense to track both together for
>>>>>>>>> consistency.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
>>>>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Walaa, thanks you for bringing up this use case. I think we need
>>>>>>>>>> to keep in mind that we require identifiers to interface with the 
>>>>>>>>>> catalog.
>>>>>>>>>> We cannot use UUIDs.
>>>>>>>>>>
>>>>>>>>>> Which means you also wouldn't be able to use Approach 1 for your
>>>>>>>>>> use case because you can't store the catalog names of multiple
>>>>>>>>>> representations in the lineage. You would need to fallback to 
>>>>>>>>>> parsing the
>>>>>>>>>> SQL for a particular representation and rebuilding the full query 
>>>>>>>>>> tree to
>>>>>>>>>> obtain the identifiers.
>>>>>>>>>>
>>>>>>>>>> You could do the same for Approach 2. So I don't see why Approach
>>>>>>>>>> 1 would yield any benefits.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jan
>>>>>>>>>> On 07.09.24 00:01, Steven Wu wrote:
>>>>>>>>>>
>>>>>>>>>> Benny, `default-catalog` is optional, while `default-namespace`
>>>>>>>>>> is required.
>>>>>>>>>>
>>>>>>>>>> I will retract my comment on the `summary`. it indicates the
>>>>>>>>>> engine that made the revision to the current view version. it doesn't
>>>>>>>>>> really matter for multi-engine/representation support.
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it
>>>>>>>>>>> wouldn't have to be under a specific representation.
>>>>>>>>>>> Walaa - That's a serious concern...  If the same catalog is
>>>>>>>>>>> aliased differently by two different engines, then the basic view 
>>>>>>>>>>> spec
>>>>>>>>>>> seems broken to me since "default-namespace" includes the catalog 
>>>>>>>>>>> alias and
>>>>>>>>>>> is outside of the SQL representation.  Does that mean for a view to 
>>>>>>>>>>> be
>>>>>>>>>>> interoperable, we require different engines to use the same catalog 
>>>>>>>>>>> name?
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Walaa, thanks for bringing up the interesting case of multiple
>>>>>>>>>>>> representations (for different engines), which definitely requires 
>>>>>>>>>>>> more
>>>>>>>>>>>> discussion from the community.
>>>>>>>>>>>>
>>>>>>>>>>>> When I am looking at the view spec, I am seeing some conflict.
>>>>>>>>>>>> "summary" field seems meant for only one engine, while 
>>>>>>>>>>>> "representations"
>>>>>>>>>>>> support multiple engines.
>>>>>>>>>>>>
>>>>>>>>>>>> "summary" : {
>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16>
>>>>>>>>>>>> "engine-name" : "Spark",
>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17>
>>>>>>>>>>>> "engineVersion" : "3.3.2"
>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> },
>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19>
>>>>>>>>>>>> "representations" : [ {
>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20>
>>>>>>>>>>>> "type" : "sql",
>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-21>
>>>>>>>>>>>> "sql" : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM 
>>>>>>>>>>>> events\nGROUP BY
>>>>>>>>>>>> 2", <https://iceberg.apache.org/view-spec/#__codelineno-5-22>
>>>>>>>>>>>> "dialect" : "spark"
>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ]
>>>>>>>>>>>>
>>>>>>>>>>>> With multiple representations/engines, I guess one engine will
>>>>>>>>>>>> be responsible for the storage table refresh and other engines are 
>>>>>>>>>>>> read
>>>>>>>>>>>> only. If we want to store the lineage info in the view, it probably
>>>>>>>>>>>> needs to be part of the "representation" struct so that each
>>>>>>>>>>>> engine/representation stores its own lineage info..
>>>>>>>>>>>> Who is to validate/ensure that the SQL representation is
>>>>>>>>>>>> actually semantically identical (minus syntax differences across 
>>>>>>>>>>>> engines)?
>>>>>>>>>>>> I guess this responsibility is left to the user who owns and 
>>>>>>>>>>>> manages the
>>>>>>>>>>>> view.
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to