Re: [DISCUSS] Iceberg Materialzied Views

Benny Chow Fri, 13 Sep 2024 16:23:47 -0700

The main reason for putting the lineage into the view is so that "another"
engine can enumerate out the tables in the view without needing to parse
any SQL.  But, if we put the lineage under the SQL representation with
engine specific catalog names, the "other" engine is not going to be able
to use those identifiers to look up the tables.  The "other" engine can
only lookup those identifiers using its engine specific catalog name.  It
may be possible to enumerate the tables at the view version level ONLY if
those identifiers don't include the catalog name.  However, if you have a
view with a cross catalog join, then the tables coming from the other
catalog have to be fully qualified.  But then the problem is that each
engine will also alias the other catalog differently too.


So, I think to summarize *multi-engine* view interoperability:

   - default-catalog can't be specified
   - default-namespace can be specified
   - View SQL can only references tables/views from the same catalog

I think these are reasonable constraints for multi-engine use cases.  If
reasonable, for MVs, then the storage table, refresh-state and lineage (at
the view version level), could all be based on *engine agnostic*
identifiers without the catalog name.  The MV and storage table would have
to be in the same catalog.

Thanks
Benny



On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul <[email protected]>
wrote:

> Hi,
>
> regarding our recent discussion on table identifiers with respect to
> different catalog_names with different query engines. We have the same
> problem when we want to reference the storage table from the common view.
> *If we include the catalog_name as part of the identifier, different query
> engines might not be able to load the storage table. *
> We could enforce that every storage table has to be part of the same
> catalog as the main view. This way an identifier without the catalog_name
> would be enough to point to the correct storage table.
>
> What are your thoughts on this?
>
> Best wishes,
>
> Jan
> On 11.09.24 16:05, Walaa Eldin Moustafa wrote:
>
> I think this type of discussion is exactly what motivates a clarification
> in the view spec so that we can resolve MV lineage. Will create separate
> thread for view spec clarification.
>
> Following up on Jan’s point, yes I agree in order to support catalog name,
> it should be at the representation level, but catalog name does not really
> depend on the “dialect” but rather on the “engine”; hence the discussion
> becomes a little more involved.
>
> Thanks,
> Walaa.
>
> On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul <[email protected]>
> <[email protected]> wrote:
>
>> Hi Benny,
>>
>> I think that identifiers only being defined for a certain representation
>> is exactly what we want. Each representation can define their own
>> identifiers that then map to an UUID. This way the "catalog_name" of the
>> identifier for a "Spark" dialect can be different then for a "Dremio"
>> dialect.
>>
>> The important part is that we still have a list of identifiers for each
>> representation that we can use with the catalog to obtain the state of the
>> source tables.
>>
>> Best wishes,
>>
>> Jan
>> On 11.09.24 01:33, Benny Chow wrote:
>>
>> Hi Walaa, I don't think the current view spec implicitly assumes a common
>> catalog name between engines.  I tested this by not specifying the
>> default-catalog and both engines could look up the correct table under the
>> shared default-namespace even when each engine uses a different catalog
>> name.
>>
>> Hi Jan, I think the issue with putting the lineage as part of the
>> representation is that that identifier only makes sense for that
>> representation's engine.  In your example, the catalog aliased as "iceberg"
>> in spark is going to have a different name in Dremio or Trino.
>>
>> IMO, if we are to store a lineage for a view, it should consist of
>> something engine agnostic like the table/view UUIDs.  This would be stored
>> at the view version level and not the representation level.  I think as we
>> get into more of these multi-engine, multi-catalog use cases for views, the
>> Iceberg Catalog is going to need to do a better job at handling CRUD by
>> UUID instead of engine specific identifiers.  Another scenario we need to
>> think through is a view that joins tables from two different catalogs.  How
>> would we represent the lineage for that in an engine agnostic way?
>>
>> Thanks
>> Benny
>>
>>
>>
>> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul <[email protected]>
>> <[email protected]> wrote:
>>
>>> Thanks Walaa and Benny for clarifying the problem. I think I have a
>>> better understanding now. Sorry for being a bit stubborn before.
>>>
>>> Wouldn't it make sense then to store the lineage as part of the
>>> representation:
>>>
>>> {
>>>
>>>     "type": "sql",
>>>
>>>     "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM
>>> events\nGROUP BY 2",
>>>
>>>     "dialect": "spark",
>>>
>>>     "lineage": [{
>>>
>>>         "identifier": { "catalog": "iceberg", "namespace": "public",
>>> "table": "events"},
>>>
>>>         "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"
>>>
>>>     }]
>>>
>>> }
>>>
>>> Best wishes,
>>>
>>> Jan
>>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
>>>
>>> Benny, thank you so much for performing the experiment. Glad that using
>>> UUIDs as keys in the state map makes more sense now.
>>>
>>> For the issue with the view spec being restrictive, I agree and I have
>>> raised the concern on the view spec PR last year [1]. I think there is some
>>> area of improvement here. At the least, if it is restrictive, it should be
>>> explicitly stated. I will start a thread on how to approach the view spec.
>>> We may need to get more insight on the view spec before finalizing the MV
>>> spec, because view spec will determine if we should proceed with one
>>> lineage (with the implicitly assumed common catalog name), or with multiple
>>> lineages (one per engine or catalog name).
>>>
>>> [1] https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <[email protected]> wrote:
>>>
>>>> Hi Walaa
>>>>
>>>> I did some testing with two different engines (Spark and Dremio)
>>>> against the same Nessie catalog and created the attached materialized view
>>>> metadata.json.  I see your point now about the SQL identifiers being
>>>> tightly coupled to the engines.  In the metadata JSON, spark refers to the
>>>> catalog as "SparkNessie", whereas Dremio refers to the catalog as
>>>> "LocalNessie".  So, this means that the fully qualified view and table
>>>> identifiers are engine specific and Dremio can't lookup a Spark identifier
>>>> and vice versa.
>>>>
>>>> *So, I think it does make sense now for the refresh-state to key off
>>>> the UUIDs and not use engine specific identifiers.  *This also means
>>>> that the materization consumer will have to fully expand the query tree and
>>>> basically diff the UUID + latest snapshot ids against the refresh state.
>>>> Would it ever make sense for the Iceberg Catalog to expose a bulk lookup
>>>> API by UUID?
>>>>
>>>> As a side note, it seems that for a materialized view to work with
>>>> multiple engines, the default-catalog and default-namespace can't be used
>>>> unless both engines use the same catalog name which seems pretty
>>>> restrictive to me.
>>>>
>>>> Thanks for the great discussions
>>>> Benny
>>>>
>>>>
>>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa <
>>>> [email protected]> wrote:
>>>>
>>>>> Jan, we definitely can store SQL identifiers of multiple
>>>>> representations in Approach 1.
>>>>>
>>>>> The takeaway is that SQL identifiers are highly coupled with engines,
>>>>> just like views. It makes sense to track both together for consistency.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul <[email protected]>
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Walaa, thanks you for bringing up this use case. I think we need to
>>>>>> keep in mind that we require identifiers to interface with the catalog. 
>>>>>> We
>>>>>> cannot use UUIDs.
>>>>>>
>>>>>> Which means you also wouldn't be able to use Approach 1 for your use
>>>>>> case because you can't store the catalog names of multiple 
>>>>>> representations
>>>>>> in the lineage. You would need to fallback to parsing the SQL for a
>>>>>> particular representation and rebuilding the full query tree to obtain 
>>>>>> the
>>>>>> identifiers.
>>>>>>
>>>>>> You could do the same for Approach 2. So I don't see why Approach 1
>>>>>> would yield any benefits.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jan
>>>>>> On 07.09.24 00:01, Steven Wu wrote:
>>>>>>
>>>>>> Benny, `default-catalog` is optional, while `default-namespace` is
>>>>>> required.
>>>>>>
>>>>>> I will retract my comment on the `summary`. it indicates the engine
>>>>>> that made the revision to the current view version. it doesn't really
>>>>>> matter for multi-engine/representation support.
>>>>>>
>>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <[email protected]> wrote:
>>>>>>
>>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it
>>>>>>> wouldn't have to be under a specific representation.
>>>>>>> Walaa - That's a serious concern...  If the same catalog is aliased
>>>>>>> differently by two different engines, then the basic view spec seems 
>>>>>>> broken
>>>>>>> to me since "default-namespace" includes the catalog alias and is 
>>>>>>> outside
>>>>>>> of the SQL representation.  Does that mean for a view to be 
>>>>>>> interoperable,
>>>>>>> we require different engines to use the same catalog name?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Walaa, thanks for bringing up the interesting case of multiple
>>>>>>>> representations (for different engines), which definitely requires more
>>>>>>>> discussion from the community.
>>>>>>>>
>>>>>>>> When I am looking at the view spec, I am seeing some conflict.
>>>>>>>> "summary" field seems meant for only one engine, while 
>>>>>>>> "representations"
>>>>>>>> support multiple engines.
>>>>>>>>
>>>>>>>> "summary" : {
>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16>
>>>>>>>> "engine-name" : "Spark",
>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17>
>>>>>>>> "engineVersion" : "3.3.2"
>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> },
>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19>
>>>>>>>> "representations" : [ {
>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20> "type" :
>>>>>>>> "sql", <https://iceberg.apache.org/view-spec/#__codelineno-5-21>
>>>>>>>> "sql" : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP 
>>>>>>>> BY
>>>>>>>> 2", <https://iceberg.apache.org/view-spec/#__codelineno-5-22>
>>>>>>>> "dialect" : "spark"
>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ]
>>>>>>>>
>>>>>>>> With multiple representations/engines, I guess one engine will be
>>>>>>>> responsible for the storage table refresh and other engines are read 
>>>>>>>> only.
>>>>>>>> If we want to store the lineage info in the view, it probably needs to 
>>>>>>>> be
>>>>>>>> part of the "representation" struct so that each engine/representation
>>>>>>>> stores its own lineage info..
>>>>>>>> Who is to validate/ensure that the SQL representation is actually
>>>>>>>> semantically identical (minus syntax differences across engines)? I 
>>>>>>>> guess
>>>>>>>> this responsibility is left to the user who owns and manages the view.
>>>>>>>>
>>>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to