Re: [DISCUSS] Iceberg Materialzied Views

Benny Chow Thu, 19 Sep 2024 10:03:01 -0700

Hi Jan

"PartialIdentifier" without the catalog name sounds good to me.  The
storage table and MV have to be in the same catalog.  That would be a good
fifth requirement to add to the list.


Thanks
Benny

On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul <[email protected]>
wrote:

> Cool, I guess it's easier to resolve these kind of things when talking in
> person.
>
> I agree with your requirements and the conclusion to use a map from UUID
> to snapshot-id/version-id as the refresh-state, as well as dropping the
> lineage in favor to just re-parsing the SQL query. This gets us around the
> "catalog alias" issue.
>
> And I'm also OK with every engine requiring their own representation to
> use the MV.
>
> There is still the issue with the identifier of the storage table and its
> catalog_name. Should we use an "PartialIdentifier" with a namespace and a
> name field, like so:
>
> {
>
>     namespace: ["bronze"],
>
>     name: "lineitem"
>
> }
>
> And require the storage table to be in the same catalog as the MV itself?
>
> Thanks,
>
> Jan
> On 19.09.24 00:50, Benny Chow wrote:
>
> Steven and I met up yesterday at the Seattle Iceberg meetup and we got to
> talking about the "catalog alias" issue.  He described it as an annoying
> problem =p
>
> I think there are some key requirements we need to support:
>
> 1. Different engines can produce and consume shared MVs with freshness
> validation.
> 2. We cannot force different engines to standardize on the alias they use
> for the catalog.
> 3. We cannot force different SQL representations to exclude catalog names
> from table identifiers or not use fully qualified table names.
> 4. MV SQL can join tables and views from multiple catalogs -> Inevitable
> with Nessie, Polaris, Unity, Tabular and others...
>
> The producing engine has to save refresh state information to let
> consuming engine know that table X is at what snapshot at the time of
> materialization.  The only way to identify this table across different
> catalog names is to use the cross catalog, globally unique UUID.  I think
> our only option is to have the refresh state map UUID to snapshot ids and
> view version ids.
>
> Assuming the above is how we store the refresh state, how does the
> consuming engine determine the current snapshot ids?  The consuming engine
> will have to fully expand the query tree at which point it will have the
> UUIDs as well as the latest snapshot ids/view versions.  This can then be
> diffed against the materialization refresh state to determine freshness.
> There isn't a need to store the view lineage information to map from UUID
> to the consumer specific identifier so that the consumer can then call back
> into the catalog with that identifier to get the latest state.  The
> consuming engine might as well just re-parse the SQL and expand the query.
>
> Personally, I'm OK with requiring that an engine must have its own SQL
> representation in order to use the MV.  To me, being able to fulfill the
> key requirements above is much more important.
>
> Thanks
> Benny
>
> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul <[email protected]>
> <[email protected]> wrote:
>
>> How about we make the *catalog_name field* of the identifier optional?
>> If the field is missing, it references a table/view in the same catalog. If
>> it is present it has to be an engine agnostic catalog name. Shouldn't the
>> catalog_names from the REST catalog spec be engine agnostic?
>>
>> I was wondering, is there no way to prescribe a catalog_name in Spark or
>> Dremio? What do you do if you include two Nessie catalogs? They can't both
>> be called LocalNessie.
>>
>> Thanks,
>>
>> Jan
>> On 14.09.24 01:23, Benny Chow wrote:
>>
>> The main reason for putting the lineage into the view is so that
>> "another" engine can enumerate out the tables in the view without needing
>> to parse any SQL.  But, if we put the lineage under the SQL representation
>> with engine specific catalog names, the "other" engine is not going to be
>> able to use those identifiers to look up the tables.  The "other" engine
>> can only lookup those identifiers using its engine specific catalog name.
>> It may be possible to enumerate the tables at the view version level ONLY
>> if those identifiers don't include the catalog name.  However, if you have
>> a view with a cross catalog join, then the tables coming from the other
>> catalog have to be fully qualified.  But then the problem is that each
>> engine will also alias the other catalog differently too.
>>
>> So, I think to summarize *multi-engine* view interoperability:
>>
>>    - default-catalog can't be specified
>>    - default-namespace can be specified
>>    - View SQL can only references tables/views from the same catalog
>>
>> I think these are reasonable constraints for multi-engine use cases.  If
>> reasonable, for MVs, then the storage table, refresh-state and lineage (at
>> the view version level), could all be based on *engine agnostic*
>> identifiers without the catalog name.  The MV and storage table would have
>> to be in the same catalog.
>>
>> Thanks
>> Benny
>>
>>
>>
>> On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul <[email protected]>
>> <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> regarding our recent discussion on table identifiers with respect to
>>> different catalog_names with different query engines. We have the same
>>> problem when we want to reference the storage table from the common view.
>>> *If we include the catalog_name as part of the identifier, different
>>> query engines might not be able to load the storage table. *
>>> We could enforce that every storage table has to be part of the same
>>> catalog as the main view. This way an identifier without the catalog_name
>>> would be enough to point to the correct storage table.
>>>
>>> What are your thoughts on this?
>>>
>>> Best wishes,
>>>
>>> Jan
>>> On 11.09.24 16:05, Walaa Eldin Moustafa wrote:
>>>
>>> I think this type of discussion is exactly what motivates a
>>> clarification in the view spec so that we can resolve MV lineage. Will
>>> create separate thread for view spec clarification.
>>>
>>> Following up on Jan’s point, yes I agree in order to support catalog
>>> name, it should be at the representation level, but catalog name does not
>>> really depend on the “dialect” but rather on the “engine”; hence the
>>> discussion becomes a little more involved.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul <[email protected]>
>>> <[email protected]> wrote:
>>>
>>>> Hi Benny,
>>>>
>>>> I think that identifiers only being defined for a certain
>>>> representation is exactly what we want. Each representation can define
>>>> their own identifiers that then map to an UUID. This way the "catalog_name"
>>>> of the identifier for a "Spark" dialect can be different then for a
>>>> "Dremio" dialect.
>>>>
>>>> The important part is that we still have a list of identifiers for each
>>>> representation that we can use with the catalog to obtain the state of the
>>>> source tables.
>>>>
>>>> Best wishes,
>>>>
>>>> Jan
>>>> On 11.09.24 01:33, Benny Chow wrote:
>>>>
>>>> Hi Walaa, I don't think the current view spec implicitly assumes a
>>>> common catalog name between engines.  I tested this by not specifying the
>>>> default-catalog and both engines could look up the correct table under the
>>>> shared default-namespace even when each engine uses a different catalog
>>>> name.
>>>>
>>>> Hi Jan, I think the issue with putting the lineage as part of the
>>>> representation is that that identifier only makes sense for that
>>>> representation's engine.  In your example, the catalog aliased as "iceberg"
>>>> in spark is going to have a different name in Dremio or Trino.
>>>>
>>>> IMO, if we are to store a lineage for a view, it should consist of
>>>> something engine agnostic like the table/view UUIDs.  This would be stored
>>>> at the view version level and not the representation level.  I think as we
>>>> get into more of these multi-engine, multi-catalog use cases for views, the
>>>> Iceberg Catalog is going to need to do a better job at handling CRUD by
>>>> UUID instead of engine specific identifiers.  Another scenario we need to
>>>> think through is a view that joins tables from two different catalogs.  How
>>>> would we represent the lineage for that in an engine agnostic way?
>>>>
>>>> Thanks
>>>> Benny
>>>>
>>>>
>>>>
>>>> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul <[email protected]>
>>>> <[email protected]> wrote:
>>>>
>>>>> Thanks Walaa and Benny for clarifying the problem. I think I have a
>>>>> better understanding now. Sorry for being a bit stubborn before.
>>>>>
>>>>> Wouldn't it make sense then to store the lineage as part of the
>>>>> representation:
>>>>>
>>>>> {
>>>>>
>>>>>     "type": "sql",
>>>>>
>>>>>     "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM
>>>>> events\nGROUP BY 2",
>>>>>
>>>>>     "dialect": "spark",
>>>>>
>>>>>     "lineage": [{
>>>>>
>>>>>         "identifier": { "catalog": "iceberg", "namespace": "public",
>>>>> "table": "events"},
>>>>>
>>>>>         "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"
>>>>>
>>>>>     }]
>>>>>
>>>>> }
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Jan
>>>>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
>>>>>
>>>>> Benny, thank you so much for performing the experiment. Glad that
>>>>> using UUIDs as keys in the state map makes more sense now.
>>>>>
>>>>> For the issue with the view spec being restrictive, I agree and I have
>>>>> raised the concern on the view spec PR last year [1]. I think there is 
>>>>> some
>>>>> area of improvement here. At the least, if it is restrictive, it should be
>>>>> explicitly stated. I will start a thread on how to approach the view spec.
>>>>> We may need to get more insight on the view spec before finalizing the MV
>>>>> spec, because view spec will determine if we should proceed with one
>>>>> lineage (with the implicitly assumed common catalog name), or with 
>>>>> multiple
>>>>> lineages (one per engine or catalog name).
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <[email protected]> wrote:
>>>>>
>>>>>> Hi Walaa
>>>>>>
>>>>>> I did some testing with two different engines (Spark and Dremio)
>>>>>> against the same Nessie catalog and created the attached materialized 
>>>>>> view
>>>>>> metadata.json.  I see your point now about the SQL identifiers being
>>>>>> tightly coupled to the engines.  In the metadata JSON, spark refers to 
>>>>>> the
>>>>>> catalog as "SparkNessie", whereas Dremio refers to the catalog as
>>>>>> "LocalNessie".  So, this means that the fully qualified view and table
>>>>>> identifiers are engine specific and Dremio can't lookup a Spark 
>>>>>> identifier
>>>>>> and vice versa.
>>>>>>
>>>>>> *So, I think it does make sense now for the refresh-state to key off
>>>>>> the UUIDs and not use engine specific identifiers.  *This also means
>>>>>> that the materization consumer will have to fully expand the query tree 
>>>>>> and
>>>>>> basically diff the UUID + latest snapshot ids against the refresh state.
>>>>>> Would it ever make sense for the Iceberg Catalog to expose a bulk lookup
>>>>>> API by UUID?
>>>>>>
>>>>>> As a side note, it seems that for a materialized view to work with
>>>>>> multiple engines, the default-catalog and default-namespace can't be used
>>>>>> unless both engines use the same catalog name which seems pretty
>>>>>> restrictive to me.
>>>>>>
>>>>>> Thanks for the great discussions
>>>>>> Benny
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Jan, we definitely can store SQL identifiers of multiple
>>>>>>> representations in Approach 1.
>>>>>>>
>>>>>>> The takeaway is that SQL identifiers are highly coupled with
>>>>>>> engines, just like views. It makes sense to track both together for
>>>>>>> consistency.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
>>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>>
>>>>>>>> Walaa, thanks you for bringing up this use case. I think we need to
>>>>>>>> keep in mind that we require identifiers to interface with the 
>>>>>>>> catalog. We
>>>>>>>> cannot use UUIDs.
>>>>>>>>
>>>>>>>> Which means you also wouldn't be able to use Approach 1 for your
>>>>>>>> use case because you can't store the catalog names of multiple
>>>>>>>> representations in the lineage. You would need to fallback to parsing 
>>>>>>>> the
>>>>>>>> SQL for a particular representation and rebuilding the full query tree 
>>>>>>>> to
>>>>>>>> obtain the identifiers.
>>>>>>>>
>>>>>>>> You could do the same for Approach 2. So I don't see why Approach 1
>>>>>>>> would yield any benefits.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jan
>>>>>>>> On 07.09.24 00:01, Steven Wu wrote:
>>>>>>>>
>>>>>>>> Benny, `default-catalog` is optional, while `default-namespace` is
>>>>>>>> required.
>>>>>>>>
>>>>>>>> I will retract my comment on the `summary`. it indicates the engine
>>>>>>>> that made the revision to the current view version. it doesn't really
>>>>>>>> matter for multi-engine/representation support.
>>>>>>>>
>>>>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it
>>>>>>>>> wouldn't have to be under a specific representation.
>>>>>>>>> Walaa - That's a serious concern...  If the same catalog is
>>>>>>>>> aliased differently by two different engines, then the basic view spec
>>>>>>>>> seems broken to me since "default-namespace" includes the catalog 
>>>>>>>>> alias and
>>>>>>>>> is outside of the SQL representation.  Does that mean for a view to be
>>>>>>>>> interoperable, we require different engines to use the same catalog 
>>>>>>>>> name?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Walaa, thanks for bringing up the interesting case of multiple
>>>>>>>>>> representations (for different engines), which definitely requires 
>>>>>>>>>> more
>>>>>>>>>> discussion from the community.
>>>>>>>>>>
>>>>>>>>>> When I am looking at the view spec, I am seeing some conflict.
>>>>>>>>>> "summary" field seems meant for only one engine, while 
>>>>>>>>>> "representations"
>>>>>>>>>> support multiple engines.
>>>>>>>>>>
>>>>>>>>>> "summary" : {
>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16>
>>>>>>>>>> "engine-name" : "Spark",
>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17>
>>>>>>>>>> "engineVersion" : "3.3.2"
>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> },
>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19>
>>>>>>>>>> "representations" : [ {
>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20> "type"
>>>>>>>>>> : "sql",
>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-21> "sql"
>>>>>>>>>> : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP BY 
>>>>>>>>>> 2",
>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-22>
>>>>>>>>>> "dialect" : "spark"
>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ]
>>>>>>>>>>
>>>>>>>>>> With multiple representations/engines, I guess one engine will be
>>>>>>>>>> responsible for the storage table refresh and other engines are read 
>>>>>>>>>> only.
>>>>>>>>>> If we want to store the lineage info in the view, it probably needs 
>>>>>>>>>> to be
>>>>>>>>>> part of the "representation" struct so that each 
>>>>>>>>>> engine/representation
>>>>>>>>>> stores its own lineage info..
>>>>>>>>>> Who is to validate/ensure that the SQL representation is actually
>>>>>>>>>> semantically identical (minus syntax differences across engines)? I 
>>>>>>>>>> guess
>>>>>>>>>> this responsibility is left to the user who owns and manages the 
>>>>>>>>>> view.
>>>>>>>>>>
>>>>>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to