Re: [DISCUSS] Iceberg Materialzied Views

Benny Chow Tue, 10 Sep 2024 16:33:59 -0700

Hi Walaa, I don't think the current view spec implicitly assumes a common
catalog name between engines.  I tested this by not specifying the
default-catalog and both engines could look up the correct table under the
shared default-namespace even when each engine uses a different catalog
name.


Hi Jan, I think the issue with putting the lineage as part of the
representation is that that identifier only makes sense for that
representation's engine.  In your example, the catalog aliased as "iceberg"
in spark is going to have a different name in Dremio or Trino.

IMO, if we are to store a lineage for a view, it should consist of
something engine agnostic like the table/view UUIDs.  This would be stored
at the view version level and not the representation level.  I think as we
get into more of these multi-engine, multi-catalog use cases for views, the
Iceberg Catalog is going to need to do a better job at handling CRUD by
UUID instead of engine specific identifiers.  Another scenario we need to
think through is a view that joins tables from two different catalogs.  How
would we represent the lineage for that in an engine agnostic way?

Thanks
Benny



On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul <jank...@mailbox.org.invalid>
wrote:

> Thanks Walaa and Benny for clarifying the problem. I think I have a better
> understanding now. Sorry for being a bit stubborn before.
>
> Wouldn't it make sense then to store the lineage as part of the
> representation:
>
> {
>
>     "type": "sql",
>
>     "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP
> BY 2",
>
>     "dialect": "spark",
>
>     "lineage": [{
>
>         "identifier": { "catalog": "iceberg", "namespace": "public",
> "table": "events"},
>
>         "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385"
>
>     }]
>
> }
>
> Best wishes,
>
> Jan
> On 09.09.24 11:59, Walaa Eldin Moustafa wrote:
>
> Benny, thank you so much for performing the experiment. Glad that using
> UUIDs as keys in the state map makes more sense now.
>
> For the issue with the view spec being restrictive, I agree and I have
> raised the concern on the view spec PR last year [1]. I think there is some
> area of improvement here. At the least, if it is restrictive, it should be
> explicitly stated. I will start a thread on how to approach the view spec.
> We may need to get more insight on the view spec before finalizing the MV
> spec, because view spec will determine if we should proceed with one
> lineage (with the implicitly assumed common catalog name), or with multiple
> lineages (one per engine or catalog name).
>
> [1] https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619
>
> Thanks,
> Walaa.
>
>
> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com> wrote:
>
>> Hi Walaa
>>
>> I did some testing with two different engines (Spark and Dremio) against
>> the same Nessie catalog and created the attached materialized view
>> metadata.json.  I see your point now about the SQL identifiers being
>> tightly coupled to the engines.  In the metadata JSON, spark refers to the
>> catalog as "SparkNessie", whereas Dremio refers to the catalog as
>> "LocalNessie".  So, this means that the fully qualified view and table
>> identifiers are engine specific and Dremio can't lookup a Spark identifier
>> and vice versa.
>>
>> *So, I think it does make sense now for the refresh-state to key off the
>> UUIDs and not use engine specific identifiers.  *This also means that
>> the materization consumer will have to fully expand the query tree and
>> basically diff the UUID + latest snapshot ids against the refresh state.
>> Would it ever make sense for the Iceberg Catalog to expose a bulk lookup
>> API by UUID?
>>
>> As a side note, it seems that for a materialized view to work with
>> multiple engines, the default-catalog and default-namespace can't be used
>> unless both engines use the same catalog name which seems pretty
>> restrictive to me.
>>
>> Thanks for the great discussions
>> Benny
>>
>>
>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Jan, we definitely can store SQL identifiers of multiple representations
>>> in Approach 1.
>>>
>>> The takeaway is that SQL identifiers are highly coupled with engines,
>>> just like views. It makes sense to track both together for consistency.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul <jank...@mailbox.org.invalid>
>>> <jank...@mailbox.org.invalid> wrote:
>>>
>>>> Walaa, thanks you for bringing up this use case. I think we need to
>>>> keep in mind that we require identifiers to interface with the catalog. We
>>>> cannot use UUIDs.
>>>>
>>>> Which means you also wouldn't be able to use Approach 1 for your use
>>>> case because you can't store the catalog names of multiple representations
>>>> in the lineage. You would need to fallback to parsing the SQL for a
>>>> particular representation and rebuilding the full query tree to obtain the
>>>> identifiers.
>>>>
>>>> You could do the same for Approach 2. So I don't see why Approach 1
>>>> would yield any benefits.
>>>>
>>>> Thanks,
>>>>
>>>> Jan
>>>> On 07.09.24 00:01, Steven Wu wrote:
>>>>
>>>> Benny, `default-catalog` is optional, while `default-namespace` is
>>>> required.
>>>>
>>>> I will retract my comment on the `summary`. it indicates the engine
>>>> that made the revision to the current view version. it doesn't really
>>>> matter for multi-engine/representation support.
>>>>
>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <btc...@gmail.com> wrote:
>>>>
>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it
>>>>> wouldn't have to be under a specific representation.
>>>>> Walaa - That's a serious concern...  If the same catalog is aliased
>>>>> differently by two different engines, then the basic view spec seems 
>>>>> broken
>>>>> to me since "default-namespace" includes the catalog alias and is outside
>>>>> of the SQL representation.  Does that mean for a view to be interoperable,
>>>>> we require different engines to use the same catalog name?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>>>
>>>>>> Walaa, thanks for bringing up the interesting case of multiple
>>>>>> representations (for different engines), which definitely requires more
>>>>>> discussion from the community.
>>>>>>
>>>>>> When I am looking at the view spec, I am seeing some conflict.
>>>>>> "summary" field seems meant for only one engine, while "representations"
>>>>>> support multiple engines.
>>>>>>
>>>>>> "summary" : {
>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16>
>>>>>> "engine-name" : "Spark",
>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17>
>>>>>> "engineVersion" : "3.3.2"
>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> },
>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19>
>>>>>> "representations" : [ {
>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20> "type" :
>>>>>> "sql", <https://iceberg.apache.org/view-spec/#__codelineno-5-21>
>>>>>> "sql" : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP BY
>>>>>> 2", <https://iceberg.apache.org/view-spec/#__codelineno-5-22>
>>>>>> "dialect" : "spark"
>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ]
>>>>>>
>>>>>> With multiple representations/engines, I guess one engine will be
>>>>>> responsible for the storage table refresh and other engines are read 
>>>>>> only.
>>>>>> If we want to store the lineage info in the view, it probably needs to be
>>>>>> part of the "representation" struct so that each engine/representation
>>>>>> stores its own lineage info..
>>>>>> Who is to validate/ensure that the SQL representation is actually
>>>>>> semantically identical (minus syntax differences across engines)? I guess
>>>>>> this responsibility is left to the user who owns and manages the view.
>>>>>>
>>>>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to