Re: Proposal: Support for views in Iceberg

Piotr Findeisen Thu, 22 Jul 2021 06:35:19 -0700

Hi,

First of all thank you for this discussion and all the view-related work!


I agree that solving cross-engine compatibility problem may not be primary
feature today, I am concerned that not thinking about this from the start
may "tunnel" us into a wrong direction.
Cross-engine compatible views would be such a cool feature that it is hard
to just let it pass.

My thinking about a smaller IR may be a side-product of me not being
familiar enough with Calcite.
However, with new IR being focused on compatible representation, and not
being tied to anything are actually good things.
For example, we need to focus on JSON representation, but we don't need to
deal with tree traversal or anything, so the code for this could be pretty
simple.

>  Allow only ANSI-compliant SQL and anything that is truly common across
engines in the view definition (this is how currently Netflix uses these
'common' views across Spark and Trino)

it's interesting. Anjali, do  you have means to enforce that, or is this
just a convention?

What are the common building blocks (relational operations, constructs and
functions) that you found sufficient for expressing your views?
Being able to enumerate them could help validate various approaches
considered here, including feasibility of dedicated representation.


Best,
PF




On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <rym...@gmail.com> wrote:

> Hey Anjali,
>
> I am definitely happy to help with implementing 1-3 in your first list
> once the spec has been approved by the community. My hope is that the final
> version of the view spec will make it easy to re-use existing rollback/time
> travel/metadata etc functionalities.
>
> Regarding SQL dialects.My personal opinion is: Enforcing ANSI-compliant
> SQL across all engines is hard and probably not desirable while storing
> Calcite makes it hard for eg python to use views. A project to make a cross
> language and cross engine IR for sql views and the relevant transpilers is
> imho outside the scope of this spec and probably deserving an apache
> project of its own. A smaller IR like Piotr suggested is possible but I
> feel it will likely quickly snowball into a larger project and slow down
> adoption of the view spec in iceberg. So I think the most reasonable way
> forward is to add a dialect field and a warning to engines that views are
> not (yet) cross compatible. This is at odds with the original spirit of
> iceberg tables and I wonder how the border community feels about it? I
> would hope that we can make the view spec engine-free over time and
> eventually deprecate the dialect field.
>
> Best,
> Ryan
>
> PS if anyone is interested in collaborating on engine agnostic views
> please reach out. I am keen on exploring this topic.
>
> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
> <anorw...@netflix.com.invalid> wrote:
>
>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I have and
>> will continue to address them in the doc.
>> Great to know about Trino views, Piotr!
>>
>> Thanks to everybody who has offered help with implementation. The spec as
>> it is proposed in the doc has been implemented and is in use at Netflix
>> (currently on Iceberg 0.9). Once we close the spec, we will rebase our code
>> to Iceberg-0.12 and incorporate changes to format and other feedback from
>> the community and should be able to make this MVP implementation available
>> quickly as a PR.
>>
>> A few areas that we have not yet worked on and would love for the
>> community to help are:
>> 1. Time travel on views: Be able to access the view as of a version or
>> time
>> 2. History table: A system table implementation for $versions similar to
>> the $snapshots table in order to display the history of a view
>> 3. Rollback to a version: A way to rollback a view to a previous version
>> 4. Engine agnostic SQL: more below.
>>
>> One comment that is worth a broader discussion is the dialect of the SQL
>> stored in the view metadata. The purpose of the spec is to provide a
>> storage format for view metadata and APIs to access that metadata. The
>> dialect of the SQL stored is an orthogonal question and is outside the
>> scope of this spec.
>>
>> Nonetheless, it is an important concern, so compiling a few suggestions
>> that came up in the comments to continue the discussion:
>> 1. Allow only ANSI-compliant SQL and anything that is truly common across
>> engines in the view definition (this is how currently Netflix uses these
>> 'common' views across Spark and Trino)
>> 2. Add a field to the view metadata to identify the dialect of the SQL.
>> This allows for any desired dialect, but no improved cross-engine
>> operability
>> 3. Store AST produced by Calcite in the view metadata and translate back
>> and forth between engine-supported SQL and AST
>> 4. Intermediate structured language of our own. (What additional
>> functionality does it provide over Calcite?)
>>
>> Given that the view metadata is json, it is easily extendable to
>> incorporate any new fields needed to make the SQL truly compatible across
>> engines.
>>
>> What do you think?
>>
>> regards,
>> Anjali
>>
>>
>>
>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>>> Hi,
>>>
>>> FWIW, in Trino we just added Trino views support.
>>> https://github.com/trinodb/trino/pull/8540
>>> Of course, this is by no means usable by other query engines.
>>>
>>> Anjali, your document does not talk much about compatibility between
>>> query engines.
>>> How do you plan to address that?
>>>
>>> For example, I am familiar with Coral, and I appreciate its powers for
>>> dealing with legacy stuff like views defined by Hive.
>>> I treat it as a great technology supporting transitioning from a query
>>> engine to a better one.
>>> However, I would not base a design of some new system for storing
>>> cross-engine compatible views on that.
>>>
>>> Is there something else we can use?
>>> Maybe the view definition should use some intermediate structured
>>> language that's not SQL?
>>> For example, it could represent logical structure of operations in
>>> semantics manner.
>>> This would eliminate need for cross-engine compatible parsing and
>>> analysis.
>>>
>>> Best
>>> PF
>>>
>>>
>>>
>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <rym...@gmail.com> wrote:
>>>
>>>> Thanks Anjali!
>>>>
>>>> I have left some comments on the document. I unfortunately have to miss
>>>> the community meetup tomorrow but would love to chat more/help w/
>>>> implementation.
>>>>
>>>> Best,
>>>> Ryan
>>>>
>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>> <anorw...@netflix.com.invalid> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> John Zhuge and I would like to propose the following spec for storing
>>>>> view metadata in Iceberg. The proposal has been implemented [1] and is in
>>>>> production at Netflix for over 15 months.
>>>>>
>>>>>
>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>
>>>>> [1]
>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>
>>>>> Please let us know your thoughts by adding comments to the doc.
>>>>>
>>>>> Thanks,
>>>>> Anjali.
>>>>>
>>>>

Re: Proposal: Support for views in Iceberg

Reply via email to