Re: Proposal: Support for views in Iceberg

John Zhuge Sun, 22 Aug 2021 11:12:45 -0700

+1

On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <b...@tabular.io> wrote:


> Thanks for working on this, Anjali! It’s great to see the thorough
> discussion here with everyone.
>
> For the discussion about SQL dialect, I think that the right first step is
> to capture the SQL or query dialect. That will give us the most
> flexibility. Engines that can use Coral for translation can attempt to
> convert and engines that don’t can see if the SQL is valid and can be used.
>
> I think that the idea to create a minimal IR is an interesting one, but
> can be added later. We will always need to record the SQL and dialect, even
> if we translate to IR because users are going to configure views using SQL.
> Uses like showing view history or debugging need to show the original SQL,
> plus relevant information like where it was created and the SQL dialect. We
> should be able to add this later by adding additional metadata to the view
> definition. I don’t think that it would introduce breaking changes to add a
> common representation that can be optionally consumed.
>
> Let’s continue talking about a minimal IR, separately. View translation is
> a hard problem. Right now, to get views across engines we have to
> compromise confidence. IR is a way to have strong confidence, but with
> limited expressibility. I think that’s a good trade in a lot of cases and
> is worth pursuing, even if it will take a long time.
>
> Jacques makes a great point about types, but I think that the right option
> here is to continue using Iceberg types. We’ve already had discussions
> about whether Iceberg should support annotating types with engine-specific
> ones, so we have a reasonable way to improve this while also providing
> compatibility across engines: char(n) is not necessarily supported
> everywhere and mapping it to string will make sense in most places. The
> schema is primarily used to validate that the data produced by the query
> hasn’t changed and that is more about the number of columns in structs and
> the names of fields rather than exact types. We can fix up types when
> substituting without losing too much: if the SQL produces a varchar(10)
> field that the view metadata says is a string, then it’s okay that it is
> varchar(10). There is some loss in that we don’t know if it was
> originally varchar(5), but I think that this is not going to cause too
> many issues. Not all engines will even validate that the schema has not
> changed, since it could be valid to use select * from x where ... and
> allow new fields to appear.
>
> Right now, I think we should move forward with the proposal as it is, and
> pursue type annotations and possibly IR in parallel. Does that sound
> reasonable to everyone?
>
> Ryan
>
> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi Anjali,
>>
>> That's a nice summary.
>>
>> re dialect field. It shouldn't be a bit trouble to have it (or any other
>> way to identify application that created the view), and it might be useful.
>> Why not make it required from the start?
>>
>> re "expanded/resolved SQL" -- i don't understand yet what we would put
>> there, so cannot comment.
>>
>> I agree there it's nice to get something out of the door, and I see how
>> the current proposal fits some needs already.
>> However, i am concerned about the proliferation of non-cross-engine
>> compatible views, if we do that.
>>
>> Also, if we later agree on any compatible approach (portable subset of
>> SQL, engine-agnostic IR, etc.), then from the perspective of each engine,
>> it would be a breaking change.
>> Unless we make the compatible approach as expressive as full power of
>> SQL, some views that are possible to create in v1 will not be possible to
>> create in v2.
>> Thus, if v1  is "some SQL" and v2 is "something awesomely compatible", we
>> may not be able to roll it out.
>>
>> > the convention of common SQL has been working for a majority of users.
>> SQL features commonly used are column projections, simple filter
>> application, joins, grouping and common aggregate and scalar function. A
>> few users occasionally would like to use Trino or Spark specific functions
>> but are sometimes able to find a way to use a function that is common to
>> both the engines.
>>
>>
>> it's an awesome summary of what constructs are necessary to be able to
>> define useful views, while also keep them portable.
>>
>> To be able to express column projections, simple filter application,
>> joins, grouping and common aggregate and scalar function in a structured
>> IR, how much effort do you think would be required?
>> We didn't really talk about downsides of a structured approach, other
>> than it looks complex.
>> if we indeed estimate it as a multi-year effort, i wouldn't argue for
>> that. Maybe i were overly optimistic though.
>>
>>
>> As Jack mentioned, for engine-specific approach that's not supposed to be
>> consumed by multiple engines, we may be better served with approach that's
>> outside of Iceberg spec, like https://github.com/trinodb/trino/pull/8540.
>>
>>
>> Best,
>> PF
>>
>>
>>
>>
>>
>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>> <anorw...@netflix.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> Thank you for all the comments. I will try to address them all here
>>> together.
>>>
>>>
>>>    - @all Cross engine compatibility of view definition: Multiple
>>>    options such as engine agnostic SQL or IR of some form have been 
>>> mentioned.
>>>    We can all agree that all of these options are non-trivial to
>>>    design/implement (perhaps a multi-year effort based on the option chosen)
>>>    and merit further discussion. I would like to suggest that we continue 
>>> this
>>>    discussion but target this work for the future (v2?). In v1, we can add 
>>> an
>>>    optional dialect field and an optional expanded/resolved SQL field that 
>>> can
>>>    be interpreted by engines as they see fit. V1 can unlock many use cases
>>>    where the views are either accessed by a single engine or multi-engine 
>>> use
>>>    cases where a (common) subset of SQL is supported. This proposal allows 
>>> for
>>>    desirable features such as versioning of views and a common format of
>>>    storing view metadata while allowing extensibility in the future. *Does
>>>    anyone feel strongly otherwise?*
>>>    - @Piotr  As for common views at Netflix, the restrictions on SQL
>>>    are not enforced, but are advised as best practices. The convention of
>>>    common SQL has been working for a majority of users. SQL features 
>>> commonly
>>>    used are column projections, simple filter application, joins, grouping 
>>> and
>>>    common aggregate and scalar function. A few users occasionally would like
>>>    to use Trino or Spark specific functions but are sometimes able to find a
>>>    way to use a function that is common to both the engines.
>>>    - @Jacques and @Jack Iceberg data types are engine agnostic and
>>>    hence were picked for storing view schema. Thinking further, the schema
>>>    field should be made 'optional', since not all engines require it. (e.g.
>>>    Spark does not need it and Trino uses it only for validation).
>>>    - @Jacques Table references in the views can be arbitrary objects
>>>    such as tables from other catalogs or elasticsearch tables etc. I will
>>>    clarify it in the spec.
>>>
>>> I will work on incorporating all the comments in the spec and make the
>>> next revision available for review soon.
>>>
>>> Regards,
>>> Anjali.
>>>
>>>
>>>
>>>
>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <pi...@starburstdata.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>
>>>> I agree that tracking  dialect/origin is better than nothing.
>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>> I.e. it would be useful if there was some external app (or a human
>>>> being) that would write those alternative SQLs for each dialect.
>>>> Otherwise I am not imagining Spark writing SQL for Spark and Trino, or
>>>> Trino writing SQL for Trino and Spark.
>>>>
>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>> While i like the idea of starting with Trino SQL ANTLR grammar file
>>>> (it's really well written and resembles spec quite well), you made a good
>>>> point that grammar is only part of the problem. Coercions, function
>>>> resolution, dereference resolution, table resolution are part of query
>>>> analysis that goes beyond just grammar.
>>>> In fact, column scoping rules -- while clearly defined by the spec --
>>>> may easily differ between engines (pretty usual).
>>>> That's why i would rather lean towards some intermediate representation
>>>> that is *not *SQL, doesn't require parsing (is already structural),
>>>> nor analysis (no scopes! no implicit coercions!).
>>>> Before we embark on such a journey, it would be interesting to hear @Martin
>>>> Traverso <mar...@starburstdata.com> 's thoughts on feasibility though.
>>>>
>>>>
>>>> Best,
>>>> PF
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <jacquesnad...@gmail.com>
>>>> wrote:
>>>>
>>>>> Some thoughts...
>>>>>
>>>>>    - In general, many engines want (or may require) a resolved sql
>>>>>    field. This--at minimum--typically includes star expansion since
>>>>>    traditional view behavior is stars are expanded at view creation time
>>>>>    (since this is the only way to guarantee that the view returns the same
>>>>>    logical definition even if the underlying table changes). This may also
>>>>>    include a replacement of relative object names to absolute object names
>>>>>    based on the session catalog & namespace. If I recall correctly, Hive 
>>>>> does
>>>>>    both of these things.
>>>>>    - It isn't clear in the spec whether the table references used in
>>>>>    views are restricted to other Iceberg objects or can be arbitrary 
>>>>> objects
>>>>>    in the context of a particular engine. Maybe I missed this? For 
>>>>> example,
>>>>>    can I have a Trino engine view that references an Elasticsearch table
>>>>>    stored in an Iceberg view?
>>>>>    - Restricting schemas to the Iceberg types will likely lead to
>>>>>    unintended consequences. I appreciate the attraction to it but I think 
>>>>> it
>>>>>    would either create artificial barriers around the types of SQL that 
>>>>> are
>>>>>    allowed and/or mean that replacing a CTE with a view could potentially
>>>>>    change the behavior of the query which I believe violates most typical
>>>>>    engine behaviors. A good example of this is the simple sql statement of
>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>>>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>>>>    context, is this view disallowed? If it isn't disallowed then you have 
>>>>> an
>>>>>    issue where the view schema will be required to be different from a CTE
>>>>>    since the engine will resolve it differently than Iceberg. Even if you
>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>    - It is important to remember that Calcite is a set of libraries
>>>>>    and not a specification. There are things that can be specified in 
>>>>> Calcite
>>>>>    but in general it doesn't have formal specification as a first 
>>>>> principle.
>>>>>    It is more implementation as a first principle. This is in contrast to
>>>>>    projects like Arrow and Iceberg, which start with well-formed
>>>>>    specifications. I've been working with Calcite since before it was an
>>>>>    Apache project and I wouldn't recommend adopting it as any form of a
>>>>>    specification. On the flipside, I am very supportive of using it for a
>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>    manipulation, etc.  If anything, I'd suggest we start with the 
>>>>> adoption of
>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark [1]
>>>>>    and/or Trino [2] use. Even that is not a complete specification as 
>>>>> grammar
>>>>>    must still be interpreted with regards to type promotion, function
>>>>>    resolution, consistent unnamed expression naming, etc that aren't 
>>>>> defined
>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC 
>>>>> grammar
>>>>>    as it heavily embeds implementation details (in a good way) and relies 
>>>>> on
>>>>>    some fairly complex logic in the validator and sql2rel components to be
>>>>>    fully resolved/comprehended.
>>>>>
>>>>> Given the above, I suggest having a field which describes the
>>>>> dialect(origin?) of the view and then each engine can decide how they want
>>>>> to consume/mutate that view (and whether they want to or not). It does 
>>>>> risk
>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>> establishing a formal SQL specification to be a similarly long process to
>>>>> the couple of years it took to build the Arrow and Iceberg specifications.
>>>>> (Realistically, there is far more to specify here than there is in either
>>>>> of those two domains.)
>>>>>
>>>>> Some other notes:
>>>>>
>>>>>    - Calcite does provide a nice reference document [3] but it is not
>>>>>    sufficient to implement what is necessary for 
>>>>> parsing/validating/resolving
>>>>>    a SQL string correctly/consistently.
>>>>>    - Projects like Coral [4] are interesting here but even Coral is
>>>>>    based roughly on "HiveQL" which also doesn't have a formal 
>>>>> specification
>>>>>    process outside of the Hive version you're running. See this thread in
>>>>>    Coral slack [5]
>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>    closer to specification based [7] than Calcite but is much less 
>>>>> popular in
>>>>>    the big data domain. I also haven't reviewed it's SQL completeness 
>>>>> closely,
>>>>>    a strength of Calcite.
>>>>>    - One of the other problems with building against an
>>>>>    implementation as opposed to a specification (e.g. Calcite) is it can 
>>>>> make
>>>>>    it difficult or near impossible to implement the same algorithms again
>>>>>    without a bunch of reverse engineering. If interested in an example of
>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet spec 
>>>>> [8] for
>>>>>    how painful this kind of mistake can become.
>>>>>    - I'd love to use the SQL specification itself but nobody actually
>>>>>    implements that in its entirety and it has far too many places where 
>>>>> things
>>>>>    are "implementation-defined" [9].
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>> [2]
>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>> [4] https://github.com/linkedin/coral
>>>>> [5] https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>> [6] https://github.com/google/zetasql
>>>>> [7] https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>> [8]
>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>
>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> Did not notice that we are also discussing cross-engine
>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>
>>>>>> I would personally prefer cross-engine interoperability as a goal and
>>>>>> get the spec in the right structure in the initial release, because:
>>>>>>
>>>>>> 1. I believe that cross-engine compatibility is a critical feature of
>>>>>> Iceberg. If I am a user of an existing data lake that already supports
>>>>>> views (e.g. Hive), I don't even need Iceberg to have this view feature. I
>>>>>> can do what is now done for Trino to use views with Iceberg. I can also
>>>>>> just use a table property to indicate the table is a view and store the
>>>>>> view SQL as a table property and do my own thing in any query engine to
>>>>>> support all the view features. One of the most valuable and unique 
>>>>>> features
>>>>>> that Iceberg view can unlock is to allow a view to be created in one 
>>>>>> engine
>>>>>> and read by another. Not supporting cross-engine compatibility feels like
>>>>>> losing a lot of value to me.
>>>>>>
>>>>>> 2. In the view definition, it feels inconsistent to me that we have
>>>>>> "schema" as an Iceberg native schema, but "sql" field as the view SQL 
>>>>>> that
>>>>>> can come from any query engine. If the engine already needs to convert 
>>>>>> the
>>>>>> view schema to iceberg shema, it should just do the same for the view 
>>>>>> SQL.
>>>>>>
>>>>>> Regarding the way to achieve it, I think it comes to either Apache
>>>>>> Calcite (or some other third party alternative I don't know) or our own
>>>>>> implementation of some intermediate representation. I don't have a very
>>>>>> strong opinion, but my thoughts are the following:
>>>>>>
>>>>>> 1. Calcite is supposed to be the go-to software to deal with this
>>>>>> kind of issue, but my personal concern is that the integration is
>>>>>> definitely going to be much more involved, and it will become another
>>>>>> barrier for newer engines to onboard because it not only needs to 
>>>>>> implement
>>>>>> Iceberg APIs but also needs Calcite support. It will also start to 
>>>>>> become a
>>>>>> constant discussion around what we maintain and what we should push to
>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>
>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is to
>>>>>> have a native lightweight logical query structure representation of the
>>>>>> view SQL and store that instead of the SQL string. We already deal with
>>>>>> Expressions in Iceberg, and engines have to convert predicates to Iceberg
>>>>>> expressions for predicate pushdown. So I think it would not be hard to
>>>>>> extend on that to support this use case. Different engines can build this
>>>>>> logical structure when traversing their own AST during a create view 
>>>>>> query.
>>>>>>
>>>>>> 3. With these considerations, I think the "sql" field can potentially
>>>>>> be a map (maybe called "engine-sqls"?), where key is the engine type and
>>>>>> version like "Spark 3.1", and value is the view SQL string. In this way,
>>>>>> the engine that creates the view can still read the SQL directly which
>>>>>> might lead to better engine-native integration and avoid redundant 
>>>>>> parsing.
>>>>>> But in this approach there is always a default intermediate 
>>>>>> representation
>>>>>> it can fallback to when the engine's key is not found in the map. If we
>>>>>> want to make incremental progress and delay the design for the 
>>>>>> intermediate
>>>>>> representation, I think we should at least use this map instead of just a
>>>>>> single string.
>>>>>>
>>>>>> Thanks,
>>>>>> Jack Ye
>>>>>>
>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>> pi...@starburstdata.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> First of all thank you for this discussion and all the view-related
>>>>>>> work!
>>>>>>>
>>>>>>> I agree that solving cross-engine compatibility problem may not be
>>>>>>> primary feature today, I am concerned that not thinking about this from 
>>>>>>> the
>>>>>>> start may "tunnel" us into a wrong direction.
>>>>>>> Cross-engine compatible views would be such a cool feature that it
>>>>>>> is hard to just let it pass.
>>>>>>>
>>>>>>> My thinking about a smaller IR may be a side-product of me not being
>>>>>>> familiar enough with Calcite.
>>>>>>> However, with new IR being focused on compatible representation, and
>>>>>>> not being tied to anything are actually good things.
>>>>>>> For example, we need to focus on JSON representation, but we don't
>>>>>>> need to deal with tree traversal or anything, so the code for this 
>>>>>>> could be
>>>>>>> pretty simple.
>>>>>>>
>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly common
>>>>>>> across engines in the view definition (this is how currently Netflix 
>>>>>>> uses
>>>>>>> these 'common' views across Spark and Trino)
>>>>>>>
>>>>>>> it's interesting. Anjali, do  you have means to enforce that, or is
>>>>>>> this just a convention?
>>>>>>>
>>>>>>> What are the common building blocks (relational operations,
>>>>>>> constructs and functions) that you found sufficient for expressing your
>>>>>>> views?
>>>>>>> Being able to enumerate them could help validate various approaches
>>>>>>> considered here, including feasibility of dedicated representation.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <rym...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Anjali,
>>>>>>>>
>>>>>>>> I am definitely happy to help with implementing 1-3 in your first
>>>>>>>> list once the spec has been approved by the community. My hope is that 
>>>>>>>> the
>>>>>>>> final version of the view spec will make it easy to re-use existing
>>>>>>>> rollback/time travel/metadata etc functionalities.
>>>>>>>>
>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>> desirable while storing Calcite makes it hard for eg python to use 
>>>>>>>> views. A
>>>>>>>> project to make a cross language and cross engine IR for sql views and 
>>>>>>>> the
>>>>>>>> relevant transpilers is imho outside the scope of this spec and 
>>>>>>>> probably
>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr 
>>>>>>>> suggested
>>>>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>>>>> project and slow down adoption of the view spec in iceberg. So I think 
>>>>>>>> the
>>>>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>>>>> engines that views are not (yet) cross compatible. This is at odds 
>>>>>>>> with the
>>>>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> PS if anyone is interested in collaborating on engine agnostic
>>>>>>>> views please reach out. I am keen on exploring this topic.
>>>>>>>>
>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>> <anorw...@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I have
>>>>>>>>> and will continue to address them in the doc.
>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>
>>>>>>>>> Thanks to everybody who has offered help with implementation. The
>>>>>>>>> spec as it is proposed in the doc has been implemented and is in use 
>>>>>>>>> at
>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will 
>>>>>>>>> rebase
>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>>> other feedback from the community and should be able to make this MVP
>>>>>>>>> implementation available quickly as a PR.
>>>>>>>>>
>>>>>>>>> A few areas that we have not yet worked on and would love for the
>>>>>>>>> community to help are:
>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>> version or time
>>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>>> similar to the $snapshots table in order to display the history of a 
>>>>>>>>> view
>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a previous
>>>>>>>>> version
>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>
>>>>>>>>> One comment that is worth a broader discussion is the dialect of
>>>>>>>>> the SQL stored in the view metadata. The purpose of the spec is to 
>>>>>>>>> provide
>>>>>>>>> a storage format for view metadata and APIs to access that metadata. 
>>>>>>>>> The
>>>>>>>>> dialect of the SQL stored is an orthogonal question and is outside the
>>>>>>>>> scope of this spec.
>>>>>>>>>
>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly common
>>>>>>>>> across engines in the view definition (this is how currently Netflix 
>>>>>>>>> uses
>>>>>>>>> these 'common' views across Spark and Trino)
>>>>>>>>> 2. Add a field to the view metadata to identify the dialect of the
>>>>>>>>> SQL. This allows for any desired dialect, but no improved cross-engine
>>>>>>>>> operability
>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>> 4. Intermediate structured language of our own. (What additional
>>>>>>>>> functionality does it provide over Calcite?)
>>>>>>>>>
>>>>>>>>> Given that the view metadata is json, it is easily extendable to
>>>>>>>>> incorporate any new fields needed to make the SQL truly compatible 
>>>>>>>>> across
>>>>>>>>> engines.
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> Anjali
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>> pi...@starburstdata.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>>
>>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>>> between query engines.
>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>
>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>> I treat it as a great technology supporting transitioning from a
>>>>>>>>>> query engine to a better one.
>>>>>>>>>> However, I would not base a design of some new system for storing
>>>>>>>>>> cross-engine compatible views on that.
>>>>>>>>>>
>>>>>>>>>> Is there something else we can use?
>>>>>>>>>> Maybe the view definition should use some intermediate structured
>>>>>>>>>> language that's not SQL?
>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>> operations in semantics manner.
>>>>>>>>>> This would eliminate need for cross-engine compatible parsing and
>>>>>>>>>> analysis.
>>>>>>>>>>
>>>>>>>>>> Best
>>>>>>>>>> PF
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <rym...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>
>>>>>>>>>>> I have left some comments on the document. I unfortunately have
>>>>>>>>>>> to miss the community meetup tomorrow but would love to chat 
>>>>>>>>>>> more/help w/
>>>>>>>>>>> implementation.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>> <anorw...@netflix.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> John Zhuge and I would like to propose the following spec for
>>>>>>>>>>>> storing view metadata in Iceberg. The proposal has been 
>>>>>>>>>>>> implemented [1] and
>>>>>>>>>>>> is in production at Netflix for over 15 months.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>
>>>>>>>>>>>> Please let us know your thoughts by adding comments to the doc.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>
>>>>>>>>>>>
>
> --
> Ryan Blue
> Tabular
>
-- 
John Zhuge

Re: Proposal: Support for views in Iceberg

Reply via email to