Re: [DISCUSS] Iceberg Materialzied Views

Benny Chow Sun, 08 Sep 2024 17:29:16 -0700

Hi Walaa

I did some testing with two different engines (Spark and Dremio) against
the same Nessie catalog and created the attached materialized view
metadata.json.  I see your point now about the SQL identifiers being
tightly coupled to the engines.  In the metadata JSON, spark refers to the
catalog as "SparkNessie", whereas Dremio refers to the catalog as
"LocalNessie".  So, this means that the fully qualified view and table
identifiers are engine specific and Dremio can't lookup a Spark identifier
and vice versa.


*So, I think it does make sense now for the refresh-state to key off the
UUIDs and not use engine specific identifiers.  *This also means that the
materization consumer will have to fully expand the query tree and
basically diff the UUID + latest snapshot ids against the refresh state.
Would it ever make sense for the Iceberg Catalog to expose a bulk lookup
API by UUID?

As a side note, it seems that for a materialized view to work with multiple
engines, the default-catalog and default-namespace can't be used unless
both engines use the same catalog name which seems pretty restrictive to me.

Thanks for the great discussions
Benny


On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa <[email protected]>
wrote:

> Jan, we definitely can store SQL identifiers of multiple representations
> in Approach 1.
>
> The takeaway is that SQL identifiers are highly coupled with engines, just
> like views. It makes sense to track both together for consistency.
>
> Thanks,
> Walaa.
>
> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul <[email protected]>
> wrote:
>
>> Walaa, thanks you for bringing up this use case. I think we need to keep
>> in mind that we require identifiers to interface with the catalog. We
>> cannot use UUIDs.
>>
>> Which means you also wouldn't be able to use Approach 1 for your use case
>> because you can't store the catalog names of multiple representations in
>> the lineage. You would need to fallback to parsing the SQL for a particular
>> representation and rebuilding the full query tree to obtain the
>> identifiers.
>>
>> You could do the same for Approach 2. So I don't see why Approach 1 would
>> yield any benefits.
>>
>> Thanks,
>>
>> Jan
>> On 07.09.24 00:01, Steven Wu wrote:
>>
>> Benny, `default-catalog` is optional, while `default-namespace` is
>> required.
>>
>> I will retract my comment on the `summary`. it indicates the engine that
>> made the revision to the current view version. it doesn't really matter for
>> multi-engine/representation support.
>>
>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <[email protected]> wrote:
>>
>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it wouldn't
>>> have to be under a specific representation.
>>> Walaa - That's a serious concern...  If the same catalog is aliased
>>> differently by two different engines, then the basic view spec seems broken
>>> to me since "default-namespace" includes the catalog alias and is outside
>>> of the SQL representation.  Does that mean for a view to be interoperable,
>>> we require different engines to use the same catalog name?
>>>
>>> Thanks
>>>
>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <[email protected]> wrote:
>>>
>>>> Walaa, thanks for bringing up the interesting case of multiple
>>>> representations (for different engines), which definitely requires more
>>>> discussion from the community.
>>>>
>>>> When I am looking at the view spec, I am seeing some conflict.
>>>> "summary" field seems meant for only one engine, while "representations"
>>>> support multiple engines.
>>>>
>>>> "summary" : { <https://iceberg.apache.org/view-spec/#__codelineno-5-16>
>>>> "engine-name" : "Spark",
>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17>
>>>> "engineVersion" : "3.3.2"
>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> },
>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19>
>>>> "representations" : [ {
>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20> "type" :
>>>> "sql", <https://iceberg.apache.org/view-spec/#__codelineno-5-21> "sql"
>>>> : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP BY 2",
>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-22> "dialect" :
>>>> "spark" <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ]
>>>>
>>>> With multiple representations/engines, I guess one engine will be
>>>> responsible for the storage table refresh and other engines are read only.
>>>> If we want to store the lineage info in the view, it probably needs to be
>>>> part of the "representation" struct so that each engine/representation
>>>> stores its own lineage info..
>>>> Who is to validate/ensure that the SQL representation is actually
>>>> semantically identical (minus syntax differences across engines)? I guess
>>>> this responsibility is left to the user who owns and manages the view.
>>>>
>>>>

00000-ad8108dc-9fd1-4af6-8549-48964a40ca60.metadata.json
Description: application/json

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to