Re: Proposal: Support for views in Iceberg

Ryan Murray Thu, 22 Jul 2021 05:28:32 -0700

Hey Anjali,

I am definitely happy to help with implementing 1-3 in your first list once
the spec has been approved by the community. My hope is that the final
version of the view spec will make it easy to re-use existing rollback/time
travel/metadata etc functionalities.


Regarding SQL dialects.My personal opinion is: Enforcing ANSI-compliant SQL
across all engines is hard and probably not desirable while storing Calcite
makes it hard for eg python to use views. A project to make a cross
language and cross engine IR for sql views and the relevant transpilers is
imho outside the scope of this spec and probably deserving an apache
project of its own. A smaller IR like Piotr suggested is possible but I
feel it will likely quickly snowball into a larger project and slow down
adoption of the view spec in iceberg. So I think the most reasonable way
forward is to add a dialect field and a warning to engines that views are
not (yet) cross compatible. This is at odds with the original spirit of
iceberg tables and I wonder how the border community feels about it? I
would hope that we can make the view spec engine-free over time and
eventually deprecate the dialect field.

Best,
Ryan

PS if anyone is interested in collaborating on engine agnostic views please
reach out. I am keen on exploring this topic.

On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
<anorw...@netflix.com.invalid> wrote:

> Thank you Ryan (M), Piotr and Vivekanand for the comments. I have and will
> continue to address them in the doc.
> Great to know about Trino views, Piotr!
>
> Thanks to everybody who has offered help with implementation. The spec as
> it is proposed in the doc has been implemented and is in use at Netflix
> (currently on Iceberg 0.9). Once we close the spec, we will rebase our code
> to Iceberg-0.12 and incorporate changes to format and other feedback from
> the community and should be able to make this MVP implementation available
> quickly as a PR.
>
> A few areas that we have not yet worked on and would love for the
> community to help are:
> 1. Time travel on views: Be able to access the view as of a version or time
> 2. History table: A system table implementation for $versions similar to
> the $snapshots table in order to display the history of a view
> 3. Rollback to a version: A way to rollback a view to a previous version
> 4. Engine agnostic SQL: more below.
>
> One comment that is worth a broader discussion is the dialect of the SQL
> stored in the view metadata. The purpose of the spec is to provide a
> storage format for view metadata and APIs to access that metadata. The
> dialect of the SQL stored is an orthogonal question and is outside the
> scope of this spec.
>
> Nonetheless, it is an important concern, so compiling a few suggestions
> that came up in the comments to continue the discussion:
> 1. Allow only ANSI-compliant SQL and anything that is truly common across
> engines in the view definition (this is how currently Netflix uses these
> 'common' views across Spark and Trino)
> 2. Add a field to the view metadata to identify the dialect of the SQL.
> This allows for any desired dialect, but no improved cross-engine
> operability
> 3. Store AST produced by Calcite in the view metadata and translate back
> and forth between engine-supported SQL and AST
> 4. Intermediate structured language of our own. (What additional
> functionality does it provide over Calcite?)
>
> Given that the view metadata is json, it is easily extendable to
> incorporate any new fields needed to make the SQL truly compatible across
> engines.
>
> What do you think?
>
> regards,
> Anjali
>
>
>
> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi,
>>
>> FWIW, in Trino we just added Trino views support.
>> https://github.com/trinodb/trino/pull/8540
>> Of course, this is by no means usable by other query engines.
>>
>> Anjali, your document does not talk much about compatibility between
>> query engines.
>> How do you plan to address that?
>>
>> For example, I am familiar with Coral, and I appreciate its powers for
>> dealing with legacy stuff like views defined by Hive.
>> I treat it as a great technology supporting transitioning from a query
>> engine to a better one.
>> However, I would not base a design of some new system for storing
>> cross-engine compatible views on that.
>>
>> Is there something else we can use?
>> Maybe the view definition should use some intermediate structured
>> language that's not SQL?
>> For example, it could represent logical structure of operations in
>> semantics manner.
>> This would eliminate need for cross-engine compatible parsing and
>> analysis.
>>
>> Best
>> PF
>>
>>
>>
>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <rym...@gmail.com> wrote:
>>
>>> Thanks Anjali!
>>>
>>> I have left some comments on the document. I unfortunately have to miss
>>> the community meetup tomorrow but would love to chat more/help w/
>>> implementation.
>>>
>>> Best,
>>> Ryan
>>>
>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>> <anorw...@netflix.com.invalid> wrote:
>>>
>>>> Hello,
>>>>
>>>> John Zhuge and I would like to propose the following spec for storing
>>>> view metadata in Iceberg. The proposal has been implemented [1] and is in
>>>> production at Netflix for over 15 months.
>>>>
>>>>
>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>
>>>> [1]
>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>
>>>> Please let us know your thoughts by adding comments to the doc.
>>>>
>>>> Thanks,
>>>> Anjali.
>>>>
>>>

Re: Proposal: Support for views in Iceberg

Reply via email to