Re: Proposal: Support for views in Iceberg

Anjali Norwood Wed, 25 Aug 2021 21:46:06 -0700

Hello All,
I answered all the comments and made changes to the spec to reflect them ..
the doc now shows the new revision (link here:
https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
).
Please diff with the version named 'Spec v1' in order to see the deltas.
Please let me know if any of your comments are not
satisfactorily addressed.


@Ryan, Thank you for the insightful feedback, especially around the use of
Iceberg data types for schema and possible evolution.
I agree with your comment: *"I think we should move forward with the
proposal as it is, and pursue type annotations and possibly IR in
parallel."*
How do we achieve consensus/farm for dissent and move forward with the
proposal?

thanks,
Anjali.



On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jzh...@apache.org> wrote:

> +1
>
> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <b...@tabular.io> wrote:
>
>> Thanks for working on this, Anjali! It’s great to see the thorough
>> discussion here with everyone.
>>
>> For the discussion about SQL dialect, I think that the right first step
>> is to capture the SQL or query dialect. That will give us the most
>> flexibility. Engines that can use Coral for translation can attempt to
>> convert and engines that don’t can see if the SQL is valid and can be used.
>>
>> I think that the idea to create a minimal IR is an interesting one, but
>> can be added later. We will always need to record the SQL and dialect, even
>> if we translate to IR because users are going to configure views using SQL.
>> Uses like showing view history or debugging need to show the original SQL,
>> plus relevant information like where it was created and the SQL dialect. We
>> should be able to add this later by adding additional metadata to the view
>> definition. I don’t think that it would introduce breaking changes to add a
>> common representation that can be optionally consumed.
>>
>> Let’s continue talking about a minimal IR, separately. View translation
>> is a hard problem. Right now, to get views across engines we have to
>> compromise confidence. IR is a way to have strong confidence, but with
>> limited expressibility. I think that’s a good trade in a lot of cases and
>> is worth pursuing, even if it will take a long time.
>>
>> Jacques makes a great point about types, but I think that the right
>> option here is to continue using Iceberg types. We’ve already had
>> discussions about whether Iceberg should support annotating types with
>> engine-specific ones, so we have a reasonable way to improve this while
>> also providing compatibility across engines: char(n) is not necessarily
>> supported everywhere and mapping it to string will make sense in most
>> places. The schema is primarily used to validate that the data produced by
>> the query hasn’t changed and that is more about the number of columns in
>> structs and the names of fields rather than exact types. We can fix up
>> types when substituting without losing too much: if the SQL produces a
>> varchar(10) field that the view metadata says is a string, then it’s
>> okay that it is varchar(10). There is some loss in that we don’t know if
>> it was originally varchar(5), but I think that this is not going to
>> cause too many issues. Not all engines will even validate that the schema
>> has not changed, since it could be valid to use select * from x where ...
>> and allow new fields to appear.
>>
>> Right now, I think we should move forward with the proposal as it is, and
>> pursue type annotations and possibly IR in parallel. Does that sound
>> reasonable to everyone?
>>
>> Ryan
>>
>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>>> Hi Anjali,
>>>
>>> That's a nice summary.
>>>
>>> re dialect field. It shouldn't be a bit trouble to have it (or any other
>>> way to identify application that created the view), and it might be useful.
>>> Why not make it required from the start?
>>>
>>> re "expanded/resolved SQL" -- i don't understand yet what we would put
>>> there, so cannot comment.
>>>
>>> I agree there it's nice to get something out of the door, and I see how
>>> the current proposal fits some needs already.
>>> However, i am concerned about the proliferation of non-cross-engine
>>> compatible views, if we do that.
>>>
>>> Also, if we later agree on any compatible approach (portable subset of
>>> SQL, engine-agnostic IR, etc.), then from the perspective of each engine,
>>> it would be a breaking change.
>>> Unless we make the compatible approach as expressive as full power of
>>> SQL, some views that are possible to create in v1 will not be possible to
>>> create in v2.
>>> Thus, if v1  is "some SQL" and v2 is "something awesomely compatible",
>>> we may not be able to roll it out.
>>>
>>> > the convention of common SQL has been working for a majority of users.
>>> SQL features commonly used are column projections, simple filter
>>> application, joins, grouping and common aggregate and scalar function. A
>>> few users occasionally would like to use Trino or Spark specific functions
>>> but are sometimes able to find a way to use a function that is common to
>>> both the engines.
>>>
>>>
>>> it's an awesome summary of what constructs are necessary to be able to
>>> define useful views, while also keep them portable.
>>>
>>> To be able to express column projections, simple filter application,
>>> joins, grouping and common aggregate and scalar function in a structured
>>> IR, how much effort do you think would be required?
>>> We didn't really talk about downsides of a structured approach, other
>>> than it looks complex.
>>> if we indeed estimate it as a multi-year effort, i wouldn't argue for
>>> that. Maybe i were overly optimistic though.
>>>
>>>
>>> As Jack mentioned, for engine-specific approach that's not supposed to
>>> be consumed by multiple engines, we may be better served with approach
>>> that's outside of Iceberg spec, like
>>> https://github.com/trinodb/trino/pull/8540.
>>>
>>>
>>> Best,
>>> PF
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>> <anorw...@netflix.com.invalid> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thank you for all the comments. I will try to address them all here
>>>> together.
>>>>
>>>>
>>>>    - @all Cross engine compatibility of view definition: Multiple
>>>>    options such as engine agnostic SQL or IR of some form have been 
>>>> mentioned.
>>>>    We can all agree that all of these options are non-trivial to
>>>>    design/implement (perhaps a multi-year effort based on the option 
>>>> chosen)
>>>>    and merit further discussion. I would like to suggest that we continue 
>>>> this
>>>>    discussion but target this work for the future (v2?). In v1, we can add 
>>>> an
>>>>    optional dialect field and an optional expanded/resolved SQL field that 
>>>> can
>>>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>>>    where the views are either accessed by a single engine or multi-engine 
>>>> use
>>>>    cases where a (common) subset of SQL is supported. This proposal allows 
>>>> for
>>>>    desirable features such as versioning of views and a common format of
>>>>    storing view metadata while allowing extensibility in the future. *Does
>>>>    anyone feel strongly otherwise?*
>>>>    - @Piotr  As for common views at Netflix, the restrictions on SQL
>>>>    are not enforced, but are advised as best practices. The convention of
>>>>    common SQL has been working for a majority of users. SQL features 
>>>> commonly
>>>>    used are column projections, simple filter application, joins, grouping 
>>>> and
>>>>    common aggregate and scalar function. A few users occasionally would 
>>>> like
>>>>    to use Trino or Spark specific functions but are sometimes able to find 
>>>> a
>>>>    way to use a function that is common to both the engines.
>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic and
>>>>    hence were picked for storing view schema. Thinking further, the schema
>>>>    field should be made 'optional', since not all engines require it. (e.g.
>>>>    Spark does not need it and Trino uses it only for validation).
>>>>    - @Jacques Table references in the views can be arbitrary objects
>>>>    such as tables from other catalogs or elasticsearch tables etc. I will
>>>>    clarify it in the spec.
>>>>
>>>> I will work on incorporating all the comments in the spec and make the
>>>> next revision available for review soon.
>>>>
>>>> Regards,
>>>> Anjali.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>> pi...@starburstdata.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>
>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>> I.e. it would be useful if there was some external app (or a human
>>>>> being) that would write those alternative SQLs for each dialect.
>>>>> Otherwise I am not imagining Spark writing SQL for Spark and Trino, or
>>>>> Trino writing SQL for Trino and Spark.
>>>>>
>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>> While i like the idea of starting with Trino SQL ANTLR grammar file
>>>>> (it's really well written and resembles spec quite well), you made a good
>>>>> point that grammar is only part of the problem. Coercions, function
>>>>> resolution, dereference resolution, table resolution are part of query
>>>>> analysis that goes beyond just grammar.
>>>>> In fact, column scoping rules -- while clearly defined by the spec --
>>>>> may easily differ between engines (pretty usual).
>>>>> That's why i would rather lean towards some intermediate
>>>>> representation that is *not *SQL, doesn't require parsing (is already
>>>>> structural), nor analysis (no scopes! no implicit coercions!).
>>>>> Before we embark on such a journey, it would be interesting to hear 
>>>>> @Martin
>>>>> Traverso <mar...@starburstdata.com> 's thoughts on feasibility though.
>>>>>
>>>>>
>>>>> Best,
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>> jacquesnad...@gmail.com> wrote:
>>>>>
>>>>>> Some thoughts...
>>>>>>
>>>>>>    - In general, many engines want (or may require) a resolved sql
>>>>>>    field. This--at minimum--typically includes star expansion since
>>>>>>    traditional view behavior is stars are expanded at view creation time
>>>>>>    (since this is the only way to guarantee that the view returns the 
>>>>>> same
>>>>>>    logical definition even if the underlying table changes). This may 
>>>>>> also
>>>>>>    include a replacement of relative object names to absolute object 
>>>>>> names
>>>>>>    based on the session catalog & namespace. If I recall correctly, Hive 
>>>>>> does
>>>>>>    both of these things.
>>>>>>    - It isn't clear in the spec whether the table references used in
>>>>>>    views are restricted to other Iceberg objects or can be arbitrary 
>>>>>> objects
>>>>>>    in the context of a particular engine. Maybe I missed this? For 
>>>>>> example,
>>>>>>    can I have a Trino engine view that references an Elasticsearch table
>>>>>>    stored in an Iceberg view?
>>>>>>    - Restricting schemas to the Iceberg types will likely lead to
>>>>>>    unintended consequences. I appreciate the attraction to it but I 
>>>>>> think it
>>>>>>    would either create artificial barriers around the types of SQL that 
>>>>>> are
>>>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>>>    change the behavior of the query which I believe violates most typical
>>>>>>    engine behaviors. A good example of this is the simple sql statement 
>>>>>> of
>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>>>    context, is this view disallowed? If it isn't disallowed then you 
>>>>>> have an
>>>>>>    issue where the view schema will be required to be different from a 
>>>>>> CTE
>>>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>    - It is important to remember that Calcite is a set of libraries
>>>>>>    and not a specification. There are things that can be specified in 
>>>>>> Calcite
>>>>>>    but in general it doesn't have formal specification as a first 
>>>>>> principle.
>>>>>>    It is more implementation as a first principle. This is in contrast to
>>>>>>    projects like Arrow and Iceberg, which start with well-formed
>>>>>>    specifications. I've been working with Calcite since before it was an
>>>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the 
>>>>>> adoption of
>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>>>    and/or Trino [2] use. Even that is not a complete specification as 
>>>>>> grammar
>>>>>>    must still be interpreted with regards to type promotion, function
>>>>>>    resolution, consistent unnamed expression naming, etc that aren't 
>>>>>> defined
>>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC 
>>>>>> grammar
>>>>>>    as it heavily embeds implementation details (in a good way) and 
>>>>>> relies on
>>>>>>    some fairly complex logic in the validator and sql2rel components to 
>>>>>> be
>>>>>>    fully resolved/comprehended.
>>>>>>
>>>>>> Given the above, I suggest having a field which describes the
>>>>>> dialect(origin?) of the view and then each engine can decide how they 
>>>>>> want
>>>>>> to consume/mutate that view (and whether they want to or not). It does 
>>>>>> risk
>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>> establishing a formal SQL specification to be a similarly long process to
>>>>>> the couple of years it took to build the Arrow and Iceberg 
>>>>>> specifications.
>>>>>> (Realistically, there is far more to specify here than there is in either
>>>>>> of those two domains.)
>>>>>>
>>>>>> Some other notes:
>>>>>>
>>>>>>    - Calcite does provide a nice reference document [3] but it is
>>>>>>    not sufficient to implement what is necessary for
>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>    - Projects like Coral [4] are interesting here but even Coral is
>>>>>>    based roughly on "HiveQL" which also doesn't have a formal 
>>>>>> specification
>>>>>>    process outside of the Hive version you're running. See this thread in
>>>>>>    Coral slack [5]
>>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>>    closer to specification based [7] than Calcite but is much less 
>>>>>> popular in
>>>>>>    the big data domain. I also haven't reviewed it's SQL completeness 
>>>>>> closely,
>>>>>>    a strength of Calcite.
>>>>>>    - One of the other problems with building against an
>>>>>>    implementation as opposed to a specification (e.g. Calcite) is it can 
>>>>>> make
>>>>>>    it difficult or near impossible to implement the same algorithms again
>>>>>>    without a bunch of reverse engineering. If interested in an example of
>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet spec 
>>>>>> [8] for
>>>>>>    how painful this kind of mistake can become.
>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>    actually implements that in its entirety and it has far too many 
>>>>>> places
>>>>>>    where things are "implementation-defined" [9].
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>> [2]
>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>> [4] https://github.com/linkedin/coral
>>>>>> [5]
>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>> [6] https://github.com/google/zetasql
>>>>>> [7] https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>> [8]
>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>
>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>>
>>>>>>> I would personally prefer cross-engine interoperability as a goal
>>>>>>> and get the spec in the right structure in the initial release, because:
>>>>>>>
>>>>>>> 1. I believe that cross-engine compatibility is a critical feature
>>>>>>> of Iceberg. If I am a user of an existing data lake that already 
>>>>>>> supports
>>>>>>> views (e.g. Hive), I don't even need Iceberg to have this view feature. 
>>>>>>> I
>>>>>>> can do what is now done for Trino to use views with Iceberg. I can also
>>>>>>> just use a table property to indicate the table is a view and store the
>>>>>>> view SQL as a table property and do my own thing in any query engine to
>>>>>>> support all the view features. One of the most valuable and unique 
>>>>>>> features
>>>>>>> that Iceberg view can unlock is to allow a view to be created in one 
>>>>>>> engine
>>>>>>> and read by another. Not supporting cross-engine compatibility feels 
>>>>>>> like
>>>>>>> losing a lot of value to me.
>>>>>>>
>>>>>>> 2. In the view definition, it feels inconsistent to me that we have
>>>>>>> "schema" as an Iceberg native schema, but "sql" field as the view SQL 
>>>>>>> that
>>>>>>> can come from any query engine. If the engine already needs to convert 
>>>>>>> the
>>>>>>> view schema to iceberg shema, it should just do the same for the view 
>>>>>>> SQL.
>>>>>>>
>>>>>>> Regarding the way to achieve it, I think it comes to either Apache
>>>>>>> Calcite (or some other third party alternative I don't know) or our own
>>>>>>> implementation of some intermediate representation. I don't have a very
>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>
>>>>>>> 1. Calcite is supposed to be the go-to software to deal with this
>>>>>>> kind of issue, but my personal concern is that the integration is
>>>>>>> definitely going to be much more involved, and it will become another
>>>>>>> barrier for newer engines to onboard because it not only needs to 
>>>>>>> implement
>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to 
>>>>>>> become a
>>>>>>> constant discussion around what we maintain and what we should push to
>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>
>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is to
>>>>>>> have a native lightweight logical query structure representation of the
>>>>>>> view SQL and store that instead of the SQL string. We already deal with
>>>>>>> Expressions in Iceberg, and engines have to convert predicates to 
>>>>>>> Iceberg
>>>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>>>> extend on that to support this use case. Different engines can build 
>>>>>>> this
>>>>>>> logical structure when traversing their own AST during a create view 
>>>>>>> query.
>>>>>>>
>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is the 
>>>>>>> engine
>>>>>>> type and version like "Spark 3.1", and value is the view SQL string. In
>>>>>>> this way, the engine that creates the view can still read the SQL 
>>>>>>> directly
>>>>>>> which might lead to better engine-native integration and avoid redundant
>>>>>>> parsing. But in this approach there is always a default intermediate
>>>>>>> representation it can fallback to when the engine's key is not found in 
>>>>>>> the
>>>>>>> map. If we want to make incremental progress and delay the design for 
>>>>>>> the
>>>>>>> intermediate representation, I think we should at least use this map
>>>>>>> instead of just a single string.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>> pi...@starburstdata.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> First of all thank you for this discussion and all the view-related
>>>>>>>> work!
>>>>>>>>
>>>>>>>> I agree that solving cross-engine compatibility problem may not be
>>>>>>>> primary feature today, I am concerned that not thinking about this 
>>>>>>>> from the
>>>>>>>> start may "tunnel" us into a wrong direction.
>>>>>>>> Cross-engine compatible views would be such a cool feature that it
>>>>>>>> is hard to just let it pass.
>>>>>>>>
>>>>>>>> My thinking about a smaller IR may be a side-product of me not
>>>>>>>> being familiar enough with Calcite.
>>>>>>>> However, with new IR being focused on compatible representation,
>>>>>>>> and not being tied to anything are actually good things.
>>>>>>>> For example, we need to focus on JSON representation, but we don't
>>>>>>>> need to deal with tree traversal or anything, so the code for this 
>>>>>>>> could be
>>>>>>>> pretty simple.
>>>>>>>>
>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly common
>>>>>>>> across engines in the view definition (this is how currently Netflix 
>>>>>>>> uses
>>>>>>>> these 'common' views across Spark and Trino)
>>>>>>>>
>>>>>>>> it's interesting. Anjali, do  you have means to enforce that, or is
>>>>>>>> this just a convention?
>>>>>>>>
>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>>>> views?
>>>>>>>> Being able to enumerate them could help validate various approaches
>>>>>>>> considered here, including feasibility of dedicated representation.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> PF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <rym...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Anjali,
>>>>>>>>>
>>>>>>>>> I am definitely happy to help with implementing 1-3 in your first
>>>>>>>>> list once the spec has been approved by the community. My hope is 
>>>>>>>>> that the
>>>>>>>>> final version of the view spec will make it easy to re-use existing
>>>>>>>>> rollback/time travel/metadata etc functionalities.
>>>>>>>>>
>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>> desirable while storing Calcite makes it hard for eg python to use 
>>>>>>>>> views. A
>>>>>>>>> project to make a cross language and cross engine IR for sql views 
>>>>>>>>> and the
>>>>>>>>> relevant transpilers is imho outside the scope of this spec and 
>>>>>>>>> probably
>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr 
>>>>>>>>> suggested
>>>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I 
>>>>>>>>> think the
>>>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>>>> engines that views are not (yet) cross compatible. This is at odds 
>>>>>>>>> with the
>>>>>>>>> original spirit of iceberg tables and I wonder how the border 
>>>>>>>>> community
>>>>>>>>> feels about it? I would hope that we can make the view spec 
>>>>>>>>> engine-free
>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> PS if anyone is interested in collaborating on engine agnostic
>>>>>>>>> views please reach out. I am keen on exploring this topic.
>>>>>>>>>
>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>> <anorw...@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I have
>>>>>>>>>> and will continue to address them in the doc.
>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>
>>>>>>>>>> Thanks to everybody who has offered help with implementation. The
>>>>>>>>>> spec as it is proposed in the doc has been implemented and is in use 
>>>>>>>>>> at
>>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will 
>>>>>>>>>> rebase
>>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>>>> other feedback from the community and should be able to make this MVP
>>>>>>>>>> implementation available quickly as a PR.
>>>>>>>>>>
>>>>>>>>>> A few areas that we have not yet worked on and would love for the
>>>>>>>>>> community to help are:
>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>> version or time
>>>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>>>> similar to the $snapshots table in order to display the history of a 
>>>>>>>>>> view
>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a previous
>>>>>>>>>> version
>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>
>>>>>>>>>> One comment that is worth a broader discussion is the dialect of
>>>>>>>>>> the SQL stored in the view metadata. The purpose of the spec is to 
>>>>>>>>>> provide
>>>>>>>>>> a storage format for view metadata and APIs to access that metadata. 
>>>>>>>>>> The
>>>>>>>>>> dialect of the SQL stored is an orthogonal question and is outside 
>>>>>>>>>> the
>>>>>>>>>> scope of this spec.
>>>>>>>>>>
>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>> common across engines in the view definition (this is how currently 
>>>>>>>>>> Netflix
>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect of
>>>>>>>>>> the SQL. This allows for any desired dialect, but no improved 
>>>>>>>>>> cross-engine
>>>>>>>>>> operability
>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>> 4. Intermediate structured language of our own. (What additional
>>>>>>>>>> functionality does it provide over Calcite?)
>>>>>>>>>>
>>>>>>>>>> Given that the view metadata is json, it is easily extendable to
>>>>>>>>>> incorporate any new fields needed to make the SQL truly compatible 
>>>>>>>>>> across
>>>>>>>>>> engines.
>>>>>>>>>>
>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> regards,
>>>>>>>>>> Anjali
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>> pi...@starburstdata.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>>>
>>>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>>>> between query engines.
>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>
>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>>> I treat it as a great technology supporting transitioning from a
>>>>>>>>>>> query engine to a better one.
>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>
>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>> This would eliminate need for cross-engine compatible parsing
>>>>>>>>>>> and analysis.
>>>>>>>>>>>
>>>>>>>>>>> Best
>>>>>>>>>>> PF
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <rym...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>
>>>>>>>>>>>> I have left some comments on the document. I unfortunately have
>>>>>>>>>>>> to miss the community meetup tomorrow but would love to chat 
>>>>>>>>>>>> more/help w/
>>>>>>>>>>>> implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>> <anorw...@netflix.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec for
>>>>>>>>>>>>> storing view metadata in Iceberg. The proposal has been 
>>>>>>>>>>>>> implemented [1] and
>>>>>>>>>>>>> is in production at Netflix for over 15 months.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please let us know your thoughts by adding comments to the doc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
> --
> John Zhuge
>

Re: Proposal: Support for views in Iceberg

Reply via email to