Re: [DISCUSS] Iceberg Materialzied Views

Walaa Eldin Moustafa Wed, 11 Sep 2024 07:07:28 -0700

I think this type of discussion is exactly what motivates a clarification
in the view spec so that we can resolve MV lineage. Will create separate
thread for view spec clarification.


Following up on Jan’s point, yes I agree in order to support catalog name,
it should be at the representation level, but catalog name does not really
depend on the “dialect” but rather on the “engine”; hence the discussion
becomes a little more involved.

Thanks,
Walaa.

On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul <jank...@mailbox.org.invalid>
wrote:

> Hi Benny,
>
> I think that identifiers only being defined for a certain representation
> is exactly what we want. Each representation can define their own
> identifiers that then map to an UUID. This way the "catalog_name" of the
> identifier for a "Spark" dialect can be different then for a "Dremio"
> dialect.
>
> The important part is that we still have a list of identifiers for each
> representation that we can use with the catalog to obtain the state of the
> source tables.
>
> Best wishes,
>
> Jan
> On 11.09.24 01:33, Benny Chow wrote:
>
> Hi Walaa, I don't think the current view spec implicitly assumes a common
> catalog name between engines.  I tested this by not specifying the
> default-catalog and both engines could look up the correct table under the
> shared default-namespace even when each engine uses a different catalog
> name.
>
> Hi Jan, I think the issue with putting the lineage as part of the
> representation is that that identifier only makes sense for that
> representation's engine.  In your example, the catalog aliased as "iceberg"
> in spark is going to have a different name in Dremio or Trino.
>
> IMO, if we are to store a lineage for a view, it should consist of
> something engine agnostic like the table/view UUIDs.  This would be stored
> at the view version level and not the representation level.  I think as we
> get into more of these multi-engine, multi-catalog use cases for views, the
> Iceberg Catalog is going to need to do a better job at handling CRUD by
> UUID instead of engine specific identifiers.  Another scenario we need to
> think through is a view that joins tables from two different catalogs.  How
> would we represent the lineage for that in an engine agnostic way?
>
> Thanks
> Benny
>
>
>
> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul <jank...@mailbox.org.invalid>
> <jank...@mailbox.org.invalid> wrote:
>
>> Thanks Walaa and Benny for clarifying the problem. I think I have a
>> better understanding now. Sorry for being a bit stubborn before.
>>
>> Wouldn't it make sense then to store the lineage as part of the
>> representation:
>>
>> {
>>
>>     "type": "sql",
>>
>>     "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP
>> BY 2",
>>
>>     "dialect": "spark",
>>
>>     "lineage": [{
>>
>>         "identifier": { "catalog": "iceberg", "namespace": "public",
>> "table": "events"},
>>
>>         "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"
>>
>>     }]
>>
>> }
>>
>> Best wishes,
>>
>> Jan
>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
>>
>> Benny, thank you so much for performing the experiment. Glad that using
>> UUIDs as keys in the state map makes more sense now.
>>
>> For the issue with the view spec being restrictive, I agree and I have
>> raised the concern on the view spec PR last year [1]. I think there is some
>> area of improvement here. At the least, if it is restrictive, it should be
>> explicitly stated. I will start a thread on how to approach the view spec.
>> We may need to get more insight on the view spec before finalizing the MV
>> spec, because view spec will determine if we should proceed with one
>> lineage (with the implicitly assumed common catalog name), or with multiple
>> lineages (one per engine or catalog name).
>>
>> [1] https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com> wrote:
>>
>>> Hi Walaa
>>>
>>> I did some testing with two different engines (Spark and Dremio) against
>>> the same Nessie catalog and created the attached materialized view
>>> metadata.json.  I see your point now about the SQL identifiers being
>>> tightly coupled to the engines.  In the metadata JSON, spark refers to the
>>> catalog as "SparkNessie", whereas Dremio refers to the catalog as
>>> "LocalNessie".  So, this means that the fully qualified view and table
>>> identifiers are engine specific and Dremio can't lookup a Spark identifier
>>> and vice versa.
>>>
>>> *So, I think it does make sense now for the refresh-state to key off the
>>> UUIDs and not use engine specific identifiers.  *This also means that
>>> the materization consumer will have to fully expand the query tree and
>>> basically diff the UUID + latest snapshot ids against the refresh state.
>>> Would it ever make sense for the Iceberg Catalog to expose a bulk lookup
>>> API by UUID?
>>>
>>> As a side note, it seems that for a materialized view to work with
>>> multiple engines, the default-catalog and default-namespace can't be used
>>> unless both engines use the same catalog name which seems pretty
>>> restrictive to me.
>>>
>>> Thanks for the great discussions
>>> Benny
>>>
>>>
>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>>> Jan, we definitely can store SQL identifiers of multiple
>>>> representations in Approach 1.
>>>>
>>>> The takeaway is that SQL identifiers are highly coupled with engines,
>>>> just like views. It makes sense to track both together for consistency.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul <jank...@mailbox.org.invalid>
>>>> <jank...@mailbox.org.invalid> wrote:
>>>>
>>>>> Walaa, thanks you for bringing up this use case. I think we need to
>>>>> keep in mind that we require identifiers to interface with the catalog. We
>>>>> cannot use UUIDs.
>>>>>
>>>>> Which means you also wouldn't be able to use Approach 1 for your use
>>>>> case because you can't store the catalog names of multiple representations
>>>>> in the lineage. You would need to fallback to parsing the SQL for a
>>>>> particular representation and rebuilding the full query tree to obtain the
>>>>> identifiers.
>>>>>
>>>>> You could do the same for Approach 2. So I don't see why Approach 1
>>>>> would yield any benefits.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jan
>>>>> On 07.09.24 00:01, Steven Wu wrote:
>>>>>
>>>>> Benny, `default-catalog` is optional, while `default-namespace` is
>>>>> required.
>>>>>
>>>>> I will retract my comment on the `summary`. it indicates the engine
>>>>> that made the revision to the current view version. it doesn't really
>>>>> matter for multi-engine/representation support.
>>>>>
>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <btc...@gmail.com> wrote:
>>>>>
>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it
>>>>>> wouldn't have to be under a specific representation.
>>>>>> Walaa - That's a serious concern...  If the same catalog is aliased
>>>>>> differently by two different engines, then the basic view spec seems 
>>>>>> broken
>>>>>> to me since "default-namespace" includes the catalog alias and is outside
>>>>>> of the SQL representation.  Does that mean for a view to be 
>>>>>> interoperable,
>>>>>> we require different engines to use the same catalog name?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <stevenz...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Walaa, thanks for bringing up the interesting case of multiple
>>>>>>> representations (for different engines), which definitely requires more
>>>>>>> discussion from the community.
>>>>>>>
>>>>>>> When I am looking at the view spec, I am seeing some conflict.
>>>>>>> "summary" field seems meant for only one engine, while "representations"
>>>>>>> support multiple engines.
>>>>>>>
>>>>>>> "summary" : {
>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16>
>>>>>>> "engine-name" : "Spark",
>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17>
>>>>>>> "engineVersion" : "3.3.2"
>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> },
>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19>
>>>>>>> "representations" : [ {
>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20> "type" :
>>>>>>> "sql", <https://iceberg.apache.org/view-spec/#__codelineno-5-21>
>>>>>>> "sql" : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP 
>>>>>>> BY
>>>>>>> 2", <https://iceberg.apache.org/view-spec/#__codelineno-5-22>
>>>>>>> "dialect" : "spark"
>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ]
>>>>>>>
>>>>>>> With multiple representations/engines, I guess one engine will be
>>>>>>> responsible for the storage table refresh and other engines are read 
>>>>>>> only.
>>>>>>> If we want to store the lineage info in the view, it probably needs to 
>>>>>>> be
>>>>>>> part of the "representation" struct so that each engine/representation
>>>>>>> stores its own lineage info..
>>>>>>> Who is to validate/ensure that the SQL representation is actually
>>>>>>> semantically identical (minus syntax differences across engines)? I 
>>>>>>> guess
>>>>>>> this responsibility is left to the user who owns and manages the view.
>>>>>>>
>>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to