Re: Proposal: Support for views in Iceberg

Ryan Blue Thu, 26 Aug 2021 13:42:21 -0700

Micah or someone more familiar (Jacques?), could you summarize some of the
details of the Arrow IR proposal? What are the intended use cases and
goals? We'd want to see whether the goals align with our own.


On Thu, Aug 26, 2021 at 9:24 AM Micah Kornfield <[email protected]>
wrote:

> Small tangent:
>
> Regarding an IR, Apache Arrow is looking at this now [1] with several RFC
> proposals.  It would be nice to coordinate, at least to make sure
> translation from one IR to another isn't too onerous (in an ideal world we
> could maybe share the same one).  Either way external feedback would be
> useful.
>
> Cheers,
> Micah
>
>
> [1]
> https://mail-archives.apache.org/mod_mbox/arrow-dev/202108.mbox/%3cCAKa9qDmLq_4dcmFTJJWpp=ejd0qpl9lygynq2jagxo+lfer...@mail.gmail.com%3e
>
> On Thu, Aug 26, 2021 at 8:21 AM Ryan Blue <[email protected]> wrote:
>
>> I think that the current proposal is looking good, but it is always a
>> good idea to give people a few days to review it and bring up any issues or
>> further discussion on the topics in this thread.
>>
>> I'll also add this to the next sync agenda so we can farm for dissent
>> next week. Sometimes you can get a better read on what is still concerning
>> in person.
>>
>> Ryan
>>
>> On Wed, Aug 25, 2021 at 9:46 PM Anjali Norwood
>> <[email protected]> wrote:
>>
>>> Hello All,
>>> I answered all the comments and made changes to the spec to reflect them
>>> .. the doc now shows the new revision (link here:
>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
>>> ).
>>> Please diff with the version named 'Spec v1' in order to see the deltas.
>>> Please let me know if any of your comments are not
>>> satisfactorily addressed.
>>>
>>> @Ryan, Thank you for the insightful feedback, especially around the use
>>> of Iceberg data types for schema and possible evolution.
>>> I agree with your comment: *"I think we should move forward with the
>>> proposal as it is, and pursue type annotations and possibly IR in
>>> parallel."*
>>> How do we achieve consensus/farm for dissent and move forward with the
>>> proposal?
>>>
>>> thanks,
>>> Anjali.
>>>
>>>
>>>
>>> On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <[email protected]> wrote:
>>>
>>>> +1
>>>>
>>>> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <[email protected]> wrote:
>>>>
>>>>> Thanks for working on this, Anjali! It’s great to see the thorough
>>>>> discussion here with everyone.
>>>>>
>>>>> For the discussion about SQL dialect, I think that the right first
>>>>> step is to capture the SQL or query dialect. That will give us the most
>>>>> flexibility. Engines that can use Coral for translation can attempt to
>>>>> convert and engines that don’t can see if the SQL is valid and can be 
>>>>> used.
>>>>>
>>>>> I think that the idea to create a minimal IR is an interesting one,
>>>>> but can be added later. We will always need to record the SQL and dialect,
>>>>> even if we translate to IR because users are going to configure views 
>>>>> using
>>>>> SQL. Uses like showing view history or debugging need to show the original
>>>>> SQL, plus relevant information like where it was created and the SQL
>>>>> dialect. We should be able to add this later by adding additional metadata
>>>>> to the view definition. I don’t think that it would introduce breaking
>>>>> changes to add a common representation that can be optionally consumed.
>>>>>
>>>>> Let’s continue talking about a minimal IR, separately. View
>>>>> translation is a hard problem. Right now, to get views across engines we
>>>>> have to compromise confidence. IR is a way to have strong confidence, but
>>>>> with limited expressibility. I think that’s a good trade in a lot of cases
>>>>> and is worth pursuing, even if it will take a long time.
>>>>>
>>>>> Jacques makes a great point about types, but I think that the right
>>>>> option here is to continue using Iceberg types. We’ve already had
>>>>> discussions about whether Iceberg should support annotating types with
>>>>> engine-specific ones, so we have a reasonable way to improve this while
>>>>> also providing compatibility across engines: char(n) is not
>>>>> necessarily supported everywhere and mapping it to string will make
>>>>> sense in most places. The schema is primarily used to validate that the
>>>>> data produced by the query hasn’t changed and that is more about the 
>>>>> number
>>>>> of columns in structs and the names of fields rather than exact types. We
>>>>> can fix up types when substituting without losing too much: if the SQL
>>>>> produces a varchar(10) field that the view metadata says is a string,
>>>>> then it’s okay that it is varchar(10). There is some loss in that we
>>>>> don’t know if it was originally varchar(5), but I think that this is
>>>>> not going to cause too many issues. Not all engines will even validate 
>>>>> that
>>>>> the schema has not changed, since it could be valid to use select *
>>>>> from x where ... and allow new fields to appear.
>>>>>
>>>>> Right now, I think we should move forward with the proposal as it is,
>>>>> and pursue type annotations and possibly IR in parallel. Does that sound
>>>>> reasonable to everyone?
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Anjali,
>>>>>>
>>>>>> That's a nice summary.
>>>>>>
>>>>>> re dialect field. It shouldn't be a bit trouble to have it (or any
>>>>>> other way to identify application that created the view), and it might be
>>>>>> useful.
>>>>>> Why not make it required from the start?
>>>>>>
>>>>>> re "expanded/resolved SQL" -- i don't understand yet what we would
>>>>>> put there, so cannot comment.
>>>>>>
>>>>>> I agree there it's nice to get something out of the door, and I see
>>>>>> how the current proposal fits some needs already.
>>>>>> However, i am concerned about the proliferation of non-cross-engine
>>>>>> compatible views, if we do that.
>>>>>>
>>>>>> Also, if we later agree on any compatible approach (portable subset
>>>>>> of SQL, engine-agnostic IR, etc.), then from the perspective of each
>>>>>> engine, it would be a breaking change.
>>>>>> Unless we make the compatible approach as expressive as full power of
>>>>>> SQL, some views that are possible to create in v1 will not be possible to
>>>>>> create in v2.
>>>>>> Thus, if v1  is "some SQL" and v2 is "something awesomely
>>>>>> compatible", we may not be able to roll it out.
>>>>>>
>>>>>> > the convention of common SQL has been working for a majority of
>>>>>> users. SQL features commonly used are column projections, simple filter
>>>>>> application, joins, grouping and common aggregate and scalar function. A
>>>>>> few users occasionally would like to use Trino or Spark specific 
>>>>>> functions
>>>>>> but are sometimes able to find a way to use a function that is common to
>>>>>> both the engines.
>>>>>>
>>>>>>
>>>>>> it's an awesome summary of what constructs are necessary to be able
>>>>>> to define useful views, while also keep them portable.
>>>>>>
>>>>>> To be able to express column projections, simple filter application,
>>>>>> joins, grouping and common aggregate and scalar function in a structured
>>>>>> IR, how much effort do you think would be required?
>>>>>> We didn't really talk about downsides of a structured approach, other
>>>>>> than it looks complex.
>>>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue for
>>>>>> that. Maybe i were overly optimistic though.
>>>>>>
>>>>>>
>>>>>> As Jack mentioned, for engine-specific approach that's not supposed
>>>>>> to be consumed by multiple engines, we may be better served with approach
>>>>>> that's outside of Iceberg spec, like
>>>>>> https://github.com/trinodb/trino/pull/8540.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> PF
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thank you for all the comments. I will try to address them all here
>>>>>>> together.
>>>>>>>
>>>>>>>
>>>>>>>    - @all Cross engine compatibility of view definition: Multiple
>>>>>>>    options such as engine agnostic SQL or IR of some form have been 
>>>>>>> mentioned.
>>>>>>>    We can all agree that all of these options are non-trivial to
>>>>>>>    design/implement (perhaps a multi-year effort based on the option 
>>>>>>> chosen)
>>>>>>>    and merit further discussion. I would like to suggest that we 
>>>>>>> continue this
>>>>>>>    discussion but target this work for the future (v2?). In v1, we can 
>>>>>>> add an
>>>>>>>    optional dialect field and an optional expanded/resolved SQL field 
>>>>>>> that can
>>>>>>>    be interpreted by engines as they see fit. V1 can unlock many use 
>>>>>>> cases
>>>>>>>    where the views are either accessed by a single engine or 
>>>>>>> multi-engine use
>>>>>>>    cases where a (common) subset of SQL is supported. This proposal 
>>>>>>> allows for
>>>>>>>    desirable features such as versioning of views and a common format of
>>>>>>>    storing view metadata while allowing extensibility in the future. 
>>>>>>> *Does
>>>>>>>    anyone feel strongly otherwise?*
>>>>>>>    - @Piotr  As for common views at Netflix, the restrictions on
>>>>>>>    SQL are not enforced, but are advised as best practices. The 
>>>>>>> convention of
>>>>>>>    common SQL has been working for a majority of users. SQL features 
>>>>>>> commonly
>>>>>>>    used are column projections, simple filter application, joins, 
>>>>>>> grouping and
>>>>>>>    common aggregate and scalar function. A few users occasionally would 
>>>>>>> like
>>>>>>>    to use Trino or Spark specific functions but are sometimes able to 
>>>>>>> find a
>>>>>>>    way to use a function that is common to both the engines.
>>>>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic and
>>>>>>>    hence were picked for storing view schema. Thinking further, the 
>>>>>>> schema
>>>>>>>    field should be made 'optional', since not all engines require it. 
>>>>>>> (e.g.
>>>>>>>    Spark does not need it and Trino uses it only for validation).
>>>>>>>    - @Jacques Table references in the views can be arbitrary
>>>>>>>    objects such as tables from other catalogs or elasticsearch tables 
>>>>>>> etc. I
>>>>>>>    will clarify it in the spec.
>>>>>>>
>>>>>>> I will work on incorporating all the comments in the spec and make
>>>>>>> the next revision available for review soon.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Anjali.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>>>>
>>>>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>>>>> I.e. it would be useful if there was some external app (or a human
>>>>>>>> being) that would write those alternative SQLs for each dialect.
>>>>>>>> Otherwise I am not imagining Spark writing SQL for Spark and Trino,
>>>>>>>> or Trino writing SQL for Trino and Spark.
>>>>>>>>
>>>>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>>>>> While i like the idea of starting with Trino SQL ANTLR grammar file
>>>>>>>> (it's really well written and resembles spec quite well), you made a 
>>>>>>>> good
>>>>>>>> point that grammar is only part of the problem. Coercions, function
>>>>>>>> resolution, dereference resolution, table resolution are part of query
>>>>>>>> analysis that goes beyond just grammar.
>>>>>>>> In fact, column scoping rules -- while clearly defined by the spec
>>>>>>>> -- may easily differ between engines (pretty usual).
>>>>>>>> That's why i would rather lean towards some intermediate
>>>>>>>> representation that is *not *SQL, doesn't require parsing (is
>>>>>>>> already structural), nor analysis (no scopes! no implicit coercions!).
>>>>>>>> Before we embark on such a journey, it would be interesting to hear 
>>>>>>>> @Martin
>>>>>>>> Traverso <[email protected]> 's thoughts on feasibility
>>>>>>>> though.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> PF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Some thoughts...
>>>>>>>>>
>>>>>>>>>    - In general, many engines want (or may require) a resolved
>>>>>>>>>    sql field. This--at minimum--typically includes star expansion 
>>>>>>>>> since
>>>>>>>>>    traditional view behavior is stars are expanded at view creation 
>>>>>>>>> time
>>>>>>>>>    (since this is the only way to guarantee that the view returns the 
>>>>>>>>> same
>>>>>>>>>    logical definition even if the underlying table changes). This may 
>>>>>>>>> also
>>>>>>>>>    include a replacement of relative object names to absolute object 
>>>>>>>>> names
>>>>>>>>>    based on the session catalog & namespace. If I recall correctly, 
>>>>>>>>> Hive does
>>>>>>>>>    both of these things.
>>>>>>>>>    - It isn't clear in the spec whether the table references used
>>>>>>>>>    in views are restricted to other Iceberg objects or can be 
>>>>>>>>> arbitrary
>>>>>>>>>    objects in the context of a particular engine. Maybe I missed 
>>>>>>>>> this? For
>>>>>>>>>    example, can I have a Trino engine view that references an 
>>>>>>>>> Elasticsearch
>>>>>>>>>    table stored in an Iceberg view?
>>>>>>>>>    - Restricting schemas to the Iceberg types will likely lead to
>>>>>>>>>    unintended consequences. I appreciate the attraction to it but I 
>>>>>>>>> think it
>>>>>>>>>    would either create artificial barriers around the types of SQL 
>>>>>>>>> that are
>>>>>>>>>    allowed and/or mean that replacing a CTE with a view could 
>>>>>>>>> potentially
>>>>>>>>>    change the behavior of the query which I believe violates most 
>>>>>>>>> typical
>>>>>>>>>    engine behaviors. A good example of this is the simple sql 
>>>>>>>>> statement of
>>>>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite 
>>>>>>>>> by
>>>>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this 
>>>>>>>>> Iceberg
>>>>>>>>>    context, is this view disallowed? If it isn't disallowed then you 
>>>>>>>>> have an
>>>>>>>>>    issue where the view schema will be required to be different from 
>>>>>>>>> a CTE
>>>>>>>>>    since the engine will resolve it differently than Iceberg. Even if 
>>>>>>>>> you
>>>>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>>>>    - It is important to remember that Calcite is a set of
>>>>>>>>>    libraries and not a specification. There are things that can be 
>>>>>>>>> specified
>>>>>>>>>    in Calcite but in general it doesn't have formal specification as 
>>>>>>>>> a first
>>>>>>>>>    principle. It is more implementation as a first principle. This is 
>>>>>>>>> in
>>>>>>>>>    contrast to projects like Arrow and Iceberg, which start with 
>>>>>>>>> well-formed
>>>>>>>>>    specifications. I've been working with Calcite since before it was 
>>>>>>>>> an
>>>>>>>>>    Apache project and I wouldn't recommend adopting it as any form of 
>>>>>>>>> a
>>>>>>>>>    specification. On the flipside, I am very supportive of using it 
>>>>>>>>> for a
>>>>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the 
>>>>>>>>> adoption of
>>>>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that Spark 
>>>>>>>>> [1]
>>>>>>>>>    and/or Trino [2] use. Even that is not a complete specification as 
>>>>>>>>> grammar
>>>>>>>>>    must still be interpreted with regards to type promotion, function
>>>>>>>>>    resolution, consistent unnamed expression naming, etc that aren't 
>>>>>>>>> defined
>>>>>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC 
>>>>>>>>> grammar
>>>>>>>>>    as it heavily embeds implementation details (in a good way) and 
>>>>>>>>> relies on
>>>>>>>>>    some fairly complex logic in the validator and sql2rel components 
>>>>>>>>> to be
>>>>>>>>>    fully resolved/comprehended.
>>>>>>>>>
>>>>>>>>> Given the above, I suggest having a field which describes the
>>>>>>>>> dialect(origin?) of the view and then each engine can decide how they 
>>>>>>>>> want
>>>>>>>>> to consume/mutate that view (and whether they want to or not). It 
>>>>>>>>> does risk
>>>>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>>>>> establishing a formal SQL specification to be a similarly long 
>>>>>>>>> process to
>>>>>>>>> the couple of years it took to build the Arrow and Iceberg 
>>>>>>>>> specifications.
>>>>>>>>> (Realistically, there is far more to specify here than there is in 
>>>>>>>>> either
>>>>>>>>> of those two domains.)
>>>>>>>>>
>>>>>>>>> Some other notes:
>>>>>>>>>
>>>>>>>>>    - Calcite does provide a nice reference document [3] but it is
>>>>>>>>>    not sufficient to implement what is necessary for
>>>>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>>>>    - Projects like Coral [4] are interesting here but even Coral
>>>>>>>>>    is based roughly on "HiveQL" which also doesn't have a formal 
>>>>>>>>> specification
>>>>>>>>>    process outside of the Hive version you're running. See this 
>>>>>>>>> thread in
>>>>>>>>>    Coral slack [5]
>>>>>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>>>>>    closer to specification based [7] than Calcite but is much less 
>>>>>>>>> popular in
>>>>>>>>>    the big data domain. I also haven't reviewed it's SQL completeness 
>>>>>>>>> closely,
>>>>>>>>>    a strength of Calcite.
>>>>>>>>>    - One of the other problems with building against an
>>>>>>>>>    implementation as opposed to a specification (e.g. Calcite) is it 
>>>>>>>>> can make
>>>>>>>>>    it difficult or near impossible to implement the same algorithms 
>>>>>>>>> again
>>>>>>>>>    without a bunch of reverse engineering. If interested in an 
>>>>>>>>> example of
>>>>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet 
>>>>>>>>> spec [8] for
>>>>>>>>>    how painful this kind of mistake can become.
>>>>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>>>>    actually implements that in its entirety and it has far too many 
>>>>>>>>> places
>>>>>>>>>    where things are "implementation-defined" [9].
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>> [2]
>>>>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>>>>> [4] https://github.com/linkedin/coral
>>>>>>>>> [5]
>>>>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>>>>> [6] https://github.com/google/zetasql
>>>>>>>>> [7]
>>>>>>>>> https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>>>>> [8]
>>>>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>>>>
>>>>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>>>>> interoperability here, I will add my response in the design doc here.
>>>>>>>>>>
>>>>>>>>>> I would personally prefer cross-engine interoperability as a goal
>>>>>>>>>> and get the spec in the right structure in the initial release, 
>>>>>>>>>> because:
>>>>>>>>>>
>>>>>>>>>> 1. I believe that cross-engine compatibility is a critical
>>>>>>>>>> feature of Iceberg. If I am a user of an existing data lake that 
>>>>>>>>>> already
>>>>>>>>>> supports views (e.g. Hive), I don't even need Iceberg to have this 
>>>>>>>>>> view
>>>>>>>>>> feature. I can do what is now done for Trino to use views with 
>>>>>>>>>> Iceberg. I
>>>>>>>>>> can also just use a table property to indicate the table is a view 
>>>>>>>>>> and
>>>>>>>>>> store the view SQL as a table property and do my own thing in any 
>>>>>>>>>> query
>>>>>>>>>> engine to support all the view features. One of the most valuable and
>>>>>>>>>> unique features that Iceberg view can unlock is to allow a view to be
>>>>>>>>>> created in one engine and read by another. Not supporting 
>>>>>>>>>> cross-engine
>>>>>>>>>> compatibility feels like losing a lot of value to me.
>>>>>>>>>>
>>>>>>>>>> 2. In the view definition, it feels inconsistent to me that we
>>>>>>>>>> have "schema" as an Iceberg native schema, but "sql" field as the 
>>>>>>>>>> view SQL
>>>>>>>>>> that can come from any query engine. If the engine already needs to 
>>>>>>>>>> convert
>>>>>>>>>> the view schema to iceberg shema, it should just do the same for the 
>>>>>>>>>> view
>>>>>>>>>> SQL.
>>>>>>>>>>
>>>>>>>>>> Regarding the way to achieve it, I think it comes to either
>>>>>>>>>> Apache Calcite (or some other third party alternative I don't know) 
>>>>>>>>>> or our
>>>>>>>>>> own implementation of some intermediate representation. I don't have 
>>>>>>>>>> a very
>>>>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>>>>
>>>>>>>>>> 1. Calcite is supposed to be the go-to software to deal with this
>>>>>>>>>> kind of issue, but my personal concern is that the integration is
>>>>>>>>>> definitely going to be much more involved, and it will become another
>>>>>>>>>> barrier for newer engines to onboard because it not only needs to 
>>>>>>>>>> implement
>>>>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to 
>>>>>>>>>> become a
>>>>>>>>>> constant discussion around what we maintain and what we should push 
>>>>>>>>>> to
>>>>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>>>>
>>>>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is
>>>>>>>>>> to have a native lightweight logical query structure representation 
>>>>>>>>>> of the
>>>>>>>>>> view SQL and store that instead of the SQL string. We already deal 
>>>>>>>>>> with
>>>>>>>>>> Expressions in Iceberg, and engines have to convert predicates to 
>>>>>>>>>> Iceberg
>>>>>>>>>> expressions for predicate pushdown. So I think it would not be hard 
>>>>>>>>>> to
>>>>>>>>>> extend on that to support this use case. Different engines can build 
>>>>>>>>>> this
>>>>>>>>>> logical structure when traversing their own AST during a create view 
>>>>>>>>>> query.
>>>>>>>>>>
>>>>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is the 
>>>>>>>>>> engine
>>>>>>>>>> type and version like "Spark 3.1", and value is the view SQL string. 
>>>>>>>>>> In
>>>>>>>>>> this way, the engine that creates the view can still read the SQL 
>>>>>>>>>> directly
>>>>>>>>>> which might lead to better engine-native integration and avoid 
>>>>>>>>>> redundant
>>>>>>>>>> parsing. But in this approach there is always a default intermediate
>>>>>>>>>> representation it can fallback to when the engine's key is not found 
>>>>>>>>>> in the
>>>>>>>>>> map. If we want to make incremental progress and delay the design 
>>>>>>>>>> for the
>>>>>>>>>> intermediate representation, I think we should at least use this map
>>>>>>>>>> instead of just a single string.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Jack Ye
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> First of all thank you for this discussion and all the
>>>>>>>>>>> view-related work!
>>>>>>>>>>>
>>>>>>>>>>> I agree that solving cross-engine compatibility problem may not
>>>>>>>>>>> be primary feature today, I am concerned that not thinking about 
>>>>>>>>>>> this from
>>>>>>>>>>> the start may "tunnel" us into a wrong direction.
>>>>>>>>>>> Cross-engine compatible views would be such a cool feature that
>>>>>>>>>>> it is hard to just let it pass.
>>>>>>>>>>>
>>>>>>>>>>> My thinking about a smaller IR may be a side-product of me not
>>>>>>>>>>> being familiar enough with Calcite.
>>>>>>>>>>> However, with new IR being focused on compatible representation,
>>>>>>>>>>> and not being tied to anything are actually good things.
>>>>>>>>>>> For example, we need to focus on JSON representation, but we
>>>>>>>>>>> don't need to deal with tree traversal or anything, so the code for 
>>>>>>>>>>> this
>>>>>>>>>>> could be pretty simple.
>>>>>>>>>>>
>>>>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>> common across engines in the view definition (this is how currently 
>>>>>>>>>>> Netflix
>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>
>>>>>>>>>>> it's interesting. Anjali, do  you have means to enforce that, or
>>>>>>>>>>> is this just a convention?
>>>>>>>>>>>
>>>>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>>>>> constructs and functions) that you found sufficient for expressing 
>>>>>>>>>>> your
>>>>>>>>>>> views?
>>>>>>>>>>> Being able to enumerate them could help validate various
>>>>>>>>>>> approaches considered here, including feasibility of dedicated
>>>>>>>>>>> representation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> PF
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Anjali,
>>>>>>>>>>>>
>>>>>>>>>>>> I am definitely happy to help with implementing 1-3 in your
>>>>>>>>>>>> first list once the spec has been approved by the community. My 
>>>>>>>>>>>> hope is
>>>>>>>>>>>> that the final version of the view spec will make it easy to
>>>>>>>>>>>> re-use existing rollback/time travel/metadata etc functionalities.
>>>>>>>>>>>>
>>>>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>>>>> desirable while storing Calcite makes it hard for eg python to use 
>>>>>>>>>>>> views. A
>>>>>>>>>>>> project to make a cross language and cross engine IR for sql views 
>>>>>>>>>>>> and the
>>>>>>>>>>>> relevant transpilers is imho outside the scope of this spec and 
>>>>>>>>>>>> probably
>>>>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr 
>>>>>>>>>>>> suggested
>>>>>>>>>>>> is possible but I feel it will likely quickly snowball into a 
>>>>>>>>>>>> larger
>>>>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I 
>>>>>>>>>>>> think the
>>>>>>>>>>>> most reasonable way forward is to add a dialect field and a 
>>>>>>>>>>>> warning to
>>>>>>>>>>>> engines that views are not (yet) cross compatible. This is at odds 
>>>>>>>>>>>> with the
>>>>>>>>>>>> original spirit of iceberg tables and I wonder how the border 
>>>>>>>>>>>> community
>>>>>>>>>>>> feels about it? I would hope that we can make the view spec 
>>>>>>>>>>>> engine-free
>>>>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> PS if anyone is interested in collaborating on engine agnostic
>>>>>>>>>>>> views please reach out. I am keen on exploring this topic.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I
>>>>>>>>>>>>> have and will continue to address them in the doc.
>>>>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks to everybody who has offered help with implementation.
>>>>>>>>>>>>> The spec as it is proposed in the doc has been implemented and is 
>>>>>>>>>>>>> in use at
>>>>>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we 
>>>>>>>>>>>>> will rebase
>>>>>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>>>>>>> other feedback from the community and should be able to make this 
>>>>>>>>>>>>> MVP
>>>>>>>>>>>>> implementation available quickly as a PR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A few areas that we have not yet worked on and would love for
>>>>>>>>>>>>> the community to help are:
>>>>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>>>>> version or time
>>>>>>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>>>>>>> similar to the $snapshots table in order to display the history 
>>>>>>>>>>>>> of a view
>>>>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a
>>>>>>>>>>>>> previous version
>>>>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One comment that is worth a broader discussion is the dialect
>>>>>>>>>>>>> of the SQL stored in the view metadata. The purpose of the spec 
>>>>>>>>>>>>> is to
>>>>>>>>>>>>> provide a storage format for view metadata and APIs to access that
>>>>>>>>>>>>> metadata. The dialect of the SQL stored is an orthogonal question 
>>>>>>>>>>>>> and is
>>>>>>>>>>>>> outside the scope of this spec.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>>>>> suggestions that came up in the comments to continue the 
>>>>>>>>>>>>> discussion:
>>>>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>>> common across engines in the view definition (this is how 
>>>>>>>>>>>>> currently Netflix
>>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect of
>>>>>>>>>>>>> the SQL. This allows for any desired dialect, but no improved 
>>>>>>>>>>>>> cross-engine
>>>>>>>>>>>>> operability
>>>>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>>>>> 4. Intermediate structured language of our own. (What
>>>>>>>>>>>>> additional functionality does it provide over Calcite?)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Given that the view metadata is json, it is easily extendable
>>>>>>>>>>>>> to incorporate any new fields needed to make the SQL truly 
>>>>>>>>>>>>> compatible
>>>>>>>>>>>>> across engines.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>
>>>>>>>>>>>>> regards,
>>>>>>>>>>>>> Anjali
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>>>>>>> between query engines.
>>>>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>>>>>> I treat it as a great technology supporting transitioning
>>>>>>>>>>>>>> from a query engine to a better one.
>>>>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>>>>> This would eliminate need for cross-engine compatible parsing
>>>>>>>>>>>>>> and analysis.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>> PF
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have left some comments on the document. I unfortunately
>>>>>>>>>>>>>>> have to miss the community meetup tomorrow but would love to 
>>>>>>>>>>>>>>> chat more/help
>>>>>>>>>>>>>>> w/ implementation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec
>>>>>>>>>>>>>>>> for storing view metadata in Iceberg. The proposal has been 
>>>>>>>>>>>>>>>> implemented [1]
>>>>>>>>>>>>>>>> and is in production at Netflix for over 15 months.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please let us know your thoughts by adding comments to the
>>>>>>>>>>>>>>>> doc.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Proposal: Support for views in Iceberg

Reply via email to