Re: [DISCUSS] Iceberg Materialzied Views

Walaa Eldin Moustafa Thu, 19 Sep 2024 13:55:25 -0700

I think the solution for the storage identifier might be shared with the
end state solution for the lineage. One could imagine a "full identifier"
can be used for the storage table; however, it is
"representation"-dependent (i.e., it changes according to
which representation it is part of, or rather which engine uses it).


Also, are we asking engines (or their Iceberg implementation) to throw an
exception if the full storage table identifier was provided as part of the
MV definition? Sounds like a not very ideal UX. Note that it also conflicts
with the spirit of requirement #3.

Thanks,
Walaa.

On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <[email protected]> wrote:

> Hi Jan
>
> "PartialIdentifier" without the catalog name sounds good to me.  The
> storage table and MV have to be in the same catalog.  That would be a good
> fifth requirement to add to the list.
>
> Thanks
> Benny
>
> On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul <[email protected]>
> wrote:
>
>> Cool, I guess it's easier to resolve these kind of things when talking in
>> person.
>>
>> I agree with your requirements and the conclusion to use a map from UUID
>> to snapshot-id/version-id as the refresh-state, as well as dropping the
>> lineage in favor to just re-parsing the SQL query. This gets us around the
>> "catalog alias" issue.
>>
>> And I'm also OK with every engine requiring their own representation to
>> use the MV.
>>
>> There is still the issue with the identifier of the storage table and its
>> catalog_name. Should we use an "PartialIdentifier" with a namespace and a
>> name field, like so:
>>
>> {
>>
>>     namespace: ["bronze"],
>>
>>     name: "lineitem"
>>
>> }
>>
>> And require the storage table to be in the same catalog as the MV itself?
>>
>> Thanks,
>>
>> Jan
>> On 19.09.24 00:50, Benny Chow wrote:
>>
>> Steven and I met up yesterday at the Seattle Iceberg meetup and we got to
>> talking about the "catalog alias" issue.  He described it as an annoying
>> problem =p
>>
>> I think there are some key requirements we need to support:
>>
>> 1. Different engines can produce and consume shared MVs with freshness
>> validation.
>> 2. We cannot force different engines to standardize on the alias they use
>> for the catalog.
>> 3. We cannot force different SQL representations to exclude catalog names
>> from table identifiers or not use fully qualified table names.
>> 4. MV SQL can join tables and views from multiple catalogs -> Inevitable
>> with Nessie, Polaris, Unity, Tabular and others...
>>
>> The producing engine has to save refresh state information to let
>> consuming engine know that table X is at what snapshot at the time of
>> materialization.  The only way to identify this table across different
>> catalog names is to use the cross catalog, globally unique UUID.  I think
>> our only option is to have the refresh state map UUID to snapshot ids and
>> view version ids.
>>
>> Assuming the above is how we store the refresh state, how does the
>> consuming engine determine the current snapshot ids?  The consuming engine
>> will have to fully expand the query tree at which point it will have the
>> UUIDs as well as the latest snapshot ids/view versions.  This can then be
>> diffed against the materialization refresh state to determine freshness.
>> There isn't a need to store the view lineage information to map from UUID
>> to the consumer specific identifier so that the consumer can then call back
>> into the catalog with that identifier to get the latest state.  The
>> consuming engine might as well just re-parse the SQL and expand the query.
>>
>> Personally, I'm OK with requiring that an engine must have its own SQL
>> representation in order to use the MV.  To me, being able to fulfill the
>> key requirements above is much more important.
>>
>> Thanks
>> Benny
>>
>> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul <[email protected]>
>> <[email protected]> wrote:
>>
>>> How about we make the *catalog_name field* of the identifier optional?
>>> If the field is missing, it references a table/view in the same catalog. If
>>> it is present it has to be an engine agnostic catalog name. Shouldn't the
>>> catalog_names from the REST catalog spec be engine agnostic?
>>>
>>> I was wondering, is there no way to prescribe a catalog_name in Spark or
>>> Dremio? What do you do if you include two Nessie catalogs? They can't both
>>> be called LocalNessie.
>>>
>>> Thanks,
>>>
>>> Jan
>>> On 14.09.24 01:23, Benny Chow wrote:
>>>
>>> The main reason for putting the lineage into the view is so that
>>> "another" engine can enumerate out the tables in the view without needing
>>> to parse any SQL.  But, if we put the lineage under the SQL representation
>>> with engine specific catalog names, the "other" engine is not going to be
>>> able to use those identifiers to look up the tables.  The "other" engine
>>> can only lookup those identifiers using its engine specific catalog name.
>>> It may be possible to enumerate the tables at the view version level ONLY
>>> if those identifiers don't include the catalog name.  However, if you have
>>> a view with a cross catalog join, then the tables coming from the other
>>> catalog have to be fully qualified.  But then the problem is that each
>>> engine will also alias the other catalog differently too.
>>>
>>> So, I think to summarize *multi-engine* view interoperability:
>>>
>>>    - default-catalog can't be specified
>>>    - default-namespace can be specified
>>>    - View SQL can only references tables/views from the same catalog
>>>
>>> I think these are reasonable constraints for multi-engine use cases.  If
>>> reasonable, for MVs, then the storage table, refresh-state and lineage (at
>>> the view version level), could all be based on *engine agnostic*
>>> identifiers without the catalog name.  The MV and storage table would have
>>> to be in the same catalog.
>>>
>>> Thanks
>>> Benny
>>>
>>>
>>>
>>> On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul <[email protected]>
>>> <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> regarding our recent discussion on table identifiers with respect to
>>>> different catalog_names with different query engines. We have the same
>>>> problem when we want to reference the storage table from the common view.
>>>> *If we include the catalog_name as part of the identifier, different
>>>> query engines might not be able to load the storage table. *
>>>> We could enforce that every storage table has to be part of the same
>>>> catalog as the main view. This way an identifier without the catalog_name
>>>> would be enough to point to the correct storage table.
>>>>
>>>> What are your thoughts on this?
>>>>
>>>> Best wishes,
>>>>
>>>> Jan
>>>> On 11.09.24 16:05, Walaa Eldin Moustafa wrote:
>>>>
>>>> I think this type of discussion is exactly what motivates a
>>>> clarification in the view spec so that we can resolve MV lineage. Will
>>>> create separate thread for view spec clarification.
>>>>
>>>> Following up on Jan’s point, yes I agree in order to support catalog
>>>> name, it should be at the representation level, but catalog name does not
>>>> really depend on the “dialect” but rather on the “engine”; hence the
>>>> discussion becomes a little more involved.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>> On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul <[email protected]>
>>>> <[email protected]> wrote:
>>>>
>>>>> Hi Benny,
>>>>>
>>>>> I think that identifiers only being defined for a certain
>>>>> representation is exactly what we want. Each representation can define
>>>>> their own identifiers that then map to an UUID. This way the 
>>>>> "catalog_name"
>>>>> of the identifier for a "Spark" dialect can be different then for a
>>>>> "Dremio" dialect.
>>>>>
>>>>> The important part is that we still have a list of identifiers for
>>>>> each representation that we can use with the catalog to obtain the state 
>>>>> of
>>>>> the source tables.
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Jan
>>>>> On 11.09.24 01:33, Benny Chow wrote:
>>>>>
>>>>> Hi Walaa, I don't think the current view spec implicitly assumes a
>>>>> common catalog name between engines.  I tested this by not specifying the
>>>>> default-catalog and both engines could look up the correct table under the
>>>>> shared default-namespace even when each engine uses a different catalog
>>>>> name.
>>>>>
>>>>> Hi Jan, I think the issue with putting the lineage as part of the
>>>>> representation is that that identifier only makes sense for that
>>>>> representation's engine.  In your example, the catalog aliased as 
>>>>> "iceberg"
>>>>> in spark is going to have a different name in Dremio or Trino.
>>>>>
>>>>> IMO, if we are to store a lineage for a view, it should consist of
>>>>> something engine agnostic like the table/view UUIDs.  This would be stored
>>>>> at the view version level and not the representation level.  I think as we
>>>>> get into more of these multi-engine, multi-catalog use cases for views, 
>>>>> the
>>>>> Iceberg Catalog is going to need to do a better job at handling CRUD by
>>>>> UUID instead of engine specific identifiers.  Another scenario we need to
>>>>> think through is a view that joins tables from two different catalogs.  
>>>>> How
>>>>> would we represent the lineage for that in an engine agnostic way?
>>>>>
>>>>> Thanks
>>>>> Benny
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul <[email protected]>
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Thanks Walaa and Benny for clarifying the problem. I think I have a
>>>>>> better understanding now. Sorry for being a bit stubborn before.
>>>>>>
>>>>>> Wouldn't it make sense then to store the lineage as part of the
>>>>>> representation:
>>>>>>
>>>>>> {
>>>>>>
>>>>>>     "type": "sql",
>>>>>>
>>>>>>     "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM
>>>>>> events\nGROUP BY 2",
>>>>>>
>>>>>>     "dialect": "spark",
>>>>>>
>>>>>>     "lineage": [{
>>>>>>
>>>>>>         "identifier": { "catalog": "iceberg", "namespace": "public",
>>>>>> "table": "events"},
>>>>>>
>>>>>>         "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"
>>>>>>
>>>>>>     }]
>>>>>>
>>>>>> }
>>>>>>
>>>>>> Best wishes,
>>>>>>
>>>>>> Jan
>>>>>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
>>>>>>
>>>>>> Benny, thank you so much for performing the experiment. Glad that
>>>>>> using UUIDs as keys in the state map makes more sense now.
>>>>>>
>>>>>> For the issue with the view spec being restrictive, I agree and I
>>>>>> have raised the concern on the view spec PR last year [1]. I think there 
>>>>>> is
>>>>>> some area of improvement here. At the least, if it is restrictive, it
>>>>>> should be explicitly stated. I will start a thread on how to approach the
>>>>>> view spec. We may need to get more insight on the view spec before
>>>>>> finalizing the MV spec, because view spec will determine if we should
>>>>>> proceed with one lineage (with the implicitly assumed common catalog 
>>>>>> name),
>>>>>> or with multiple lineages (one per engine or catalog name).
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Walaa
>>>>>>>
>>>>>>> I did some testing with two different engines (Spark and Dremio)
>>>>>>> against the same Nessie catalog and created the attached materialized 
>>>>>>> view
>>>>>>> metadata.json.  I see your point now about the SQL identifiers being
>>>>>>> tightly coupled to the engines.  In the metadata JSON, spark refers to 
>>>>>>> the
>>>>>>> catalog as "SparkNessie", whereas Dremio refers to the catalog as
>>>>>>> "LocalNessie".  So, this means that the fully qualified view and table
>>>>>>> identifiers are engine specific and Dremio can't lookup a Spark 
>>>>>>> identifier
>>>>>>> and vice versa.
>>>>>>>
>>>>>>> *So, I think it does make sense now for the refresh-state to key off
>>>>>>> the UUIDs and not use engine specific identifiers.  *This also
>>>>>>> means that the materization consumer will have to fully expand the query
>>>>>>> tree and basically diff the UUID + latest snapshot ids against the 
>>>>>>> refresh
>>>>>>> state.  Would it ever make sense for the Iceberg Catalog to expose a 
>>>>>>> bulk
>>>>>>> lookup API by UUID?
>>>>>>>
>>>>>>> As a side note, it seems that for a materialized view to work with
>>>>>>> multiple engines, the default-catalog and default-namespace can't be 
>>>>>>> used
>>>>>>> unless both engines use the same catalog name which seems pretty
>>>>>>> restrictive to me.
>>>>>>>
>>>>>>> Thanks for the great discussions
>>>>>>> Benny
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Jan, we definitely can store SQL identifiers of multiple
>>>>>>>> representations in Approach 1.
>>>>>>>>
>>>>>>>> The takeaway is that SQL identifiers are highly coupled with
>>>>>>>> engines, just like views. It makes sense to track both together for
>>>>>>>> consistency.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul
>>>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Walaa, thanks you for bringing up this use case. I think we need
>>>>>>>>> to keep in mind that we require identifiers to interface with the 
>>>>>>>>> catalog.
>>>>>>>>> We cannot use UUIDs.
>>>>>>>>>
>>>>>>>>> Which means you also wouldn't be able to use Approach 1 for your
>>>>>>>>> use case because you can't store the catalog names of multiple
>>>>>>>>> representations in the lineage. You would need to fallback to parsing 
>>>>>>>>> the
>>>>>>>>> SQL for a particular representation and rebuilding the full query 
>>>>>>>>> tree to
>>>>>>>>> obtain the identifiers.
>>>>>>>>>
>>>>>>>>> You could do the same for Approach 2. So I don't see why Approach
>>>>>>>>> 1 would yield any benefits.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Jan
>>>>>>>>> On 07.09.24 00:01, Steven Wu wrote:
>>>>>>>>>
>>>>>>>>> Benny, `default-catalog` is optional, while `default-namespace` is
>>>>>>>>> required.
>>>>>>>>>
>>>>>>>>> I will retract my comment on the `summary`. it indicates the
>>>>>>>>> engine that made the revision to the current view version. it doesn't
>>>>>>>>> really matter for multi-engine/representation support.
>>>>>>>>>
>>>>>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it
>>>>>>>>>> wouldn't have to be under a specific representation.
>>>>>>>>>> Walaa - That's a serious concern...  If the same catalog is
>>>>>>>>>> aliased differently by two different engines, then the basic view 
>>>>>>>>>> spec
>>>>>>>>>> seems broken to me since "default-namespace" includes the catalog 
>>>>>>>>>> alias and
>>>>>>>>>> is outside of the SQL representation.  Does that mean for a view to 
>>>>>>>>>> be
>>>>>>>>>> interoperable, we require different engines to use the same catalog 
>>>>>>>>>> name?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Walaa, thanks for bringing up the interesting case of multiple
>>>>>>>>>>> representations (for different engines), which definitely requires 
>>>>>>>>>>> more
>>>>>>>>>>> discussion from the community.
>>>>>>>>>>>
>>>>>>>>>>> When I am looking at the view spec, I am seeing some conflict.
>>>>>>>>>>> "summary" field seems meant for only one engine, while 
>>>>>>>>>>> "representations"
>>>>>>>>>>> support multiple engines.
>>>>>>>>>>>
>>>>>>>>>>> "summary" : {
>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16>
>>>>>>>>>>> "engine-name" : "Spark",
>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17>
>>>>>>>>>>> "engineVersion" : "3.3.2"
>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> },
>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19>
>>>>>>>>>>> "representations" : [ {
>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20>
>>>>>>>>>>> "type" : "sql",
>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-21> "sql"
>>>>>>>>>>> : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP BY 
>>>>>>>>>>> 2",
>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-22>
>>>>>>>>>>> "dialect" : "spark"
>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ]
>>>>>>>>>>>
>>>>>>>>>>> With multiple representations/engines, I guess one engine will
>>>>>>>>>>> be responsible for the storage table refresh and other engines are 
>>>>>>>>>>> read
>>>>>>>>>>> only. If we want to store the lineage info in the view, it probably
>>>>>>>>>>> needs to be part of the "representation" struct so that each
>>>>>>>>>>> engine/representation stores its own lineage info..
>>>>>>>>>>> Who is to validate/ensure that the SQL representation is
>>>>>>>>>>> actually semantically identical (minus syntax differences across 
>>>>>>>>>>> engines)?
>>>>>>>>>>> I guess this responsibility is left to the user who owns and 
>>>>>>>>>>> manages the
>>>>>>>>>>> view.
>>>>>>>>>>>
>>>>>>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to