Re: Proposal: Support for views in Iceberg

Micah Kornfield Thu, 26 Aug 2021 13:51:59 -0700

I'll point a link to this thread on dev@arrow so people not subscribed to
both can way in?  Should we maybe start a new thread about IR?


https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit
is the document that Wes wrote up that started the conversation.

On Thu, Aug 26, 2021 at 1:42 PM Ryan Blue <b...@tabular.io> wrote:

> Micah or someone more familiar (Jacques?), could you summarize some of the
> details of the Arrow IR proposal? What are the intended use cases and
> goals? We'd want to see whether the goals align with our own.
>
> On Thu, Aug 26, 2021 at 9:24 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Small tangent:
>>
>> Regarding an IR, Apache Arrow is looking at this now [1] with several RFC
>> proposals.  It would be nice to coordinate, at least to make sure
>> translation from one IR to another isn't too onerous (in an ideal world we
>> could maybe share the same one).  Either way external feedback would be
>> useful.
>>
>> Cheers,
>> Micah
>>
>>
>> [1]
>> https://mail-archives.apache.org/mod_mbox/arrow-dev/202108.mbox/%3cCAKa9qDmLq_4dcmFTJJWpp=ejd0qpl9lygynq2jagxo+lfer...@mail.gmail.com%3e
>>
>> On Thu, Aug 26, 2021 at 8:21 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> I think that the current proposal is looking good, but it is always a
>>> good idea to give people a few days to review it and bring up any issues or
>>> further discussion on the topics in this thread.
>>>
>>> I'll also add this to the next sync agenda so we can farm for dissent
>>> next week. Sometimes you can get a better read on what is still concerning
>>> in person.
>>>
>>> Ryan
>>>
>>> On Wed, Aug 25, 2021 at 9:46 PM Anjali Norwood
>>> <anorw...@netflix.com.invalid> wrote:
>>>
>>>> Hello All,
>>>> I answered all the comments and made changes to the spec to reflect
>>>> them .. the doc now shows the new revision (link here:
>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
>>>> ).
>>>> Please diff with the version named 'Spec v1' in order to see the
>>>> deltas. Please let me know if any of your comments are not
>>>> satisfactorily addressed.
>>>>
>>>> @Ryan, Thank you for the insightful feedback, especially around the use
>>>> of Iceberg data types for schema and possible evolution.
>>>> I agree with your comment: *"I think we should move forward with the
>>>> proposal as it is, and pursue type annotations and possibly IR in
>>>> parallel."*
>>>> How do we achieve consensus/farm for dissent and move forward with the
>>>> proposal?
>>>>
>>>> thanks,
>>>> Anjali.
>>>>
>>>>
>>>>
>>>> On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jzh...@apache.org> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> Thanks for working on this, Anjali! It’s great to see the thorough
>>>>>> discussion here with everyone.
>>>>>>
>>>>>> For the discussion about SQL dialect, I think that the right first
>>>>>> step is to capture the SQL or query dialect. That will give us the most
>>>>>> flexibility. Engines that can use Coral for translation can attempt to
>>>>>> convert and engines that don’t can see if the SQL is valid and can be 
>>>>>> used.
>>>>>>
>>>>>> I think that the idea to create a minimal IR is an interesting one,
>>>>>> but can be added later. We will always need to record the SQL and 
>>>>>> dialect,
>>>>>> even if we translate to IR because users are going to configure views 
>>>>>> using
>>>>>> SQL. Uses like showing view history or debugging need to show the 
>>>>>> original
>>>>>> SQL, plus relevant information like where it was created and the SQL
>>>>>> dialect. We should be able to add this later by adding additional 
>>>>>> metadata
>>>>>> to the view definition. I don’t think that it would introduce breaking
>>>>>> changes to add a common representation that can be optionally consumed.
>>>>>>
>>>>>> Let’s continue talking about a minimal IR, separately. View
>>>>>> translation is a hard problem. Right now, to get views across engines we
>>>>>> have to compromise confidence. IR is a way to have strong confidence, but
>>>>>> with limited expressibility. I think that’s a good trade in a lot of 
>>>>>> cases
>>>>>> and is worth pursuing, even if it will take a long time.
>>>>>>
>>>>>> Jacques makes a great point about types, but I think that the right
>>>>>> option here is to continue using Iceberg types. We’ve already had
>>>>>> discussions about whether Iceberg should support annotating types with
>>>>>> engine-specific ones, so we have a reasonable way to improve this while
>>>>>> also providing compatibility across engines: char(n) is not
>>>>>> necessarily supported everywhere and mapping it to string will make
>>>>>> sense in most places. The schema is primarily used to validate that the
>>>>>> data produced by the query hasn’t changed and that is more about the 
>>>>>> number
>>>>>> of columns in structs and the names of fields rather than exact types. We
>>>>>> can fix up types when substituting without losing too much: if the SQL
>>>>>> produces a varchar(10) field that the view metadata says is a string,
>>>>>> then it’s okay that it is varchar(10). There is some loss in that we
>>>>>> don’t know if it was originally varchar(5), but I think that this is
>>>>>> not going to cause too many issues. Not all engines will even validate 
>>>>>> that
>>>>>> the schema has not changed, since it could be valid to use select *
>>>>>> from x where ... and allow new fields to appear.
>>>>>>
>>>>>> Right now, I think we should move forward with the proposal as it is,
>>>>>> and pursue type annotations and possibly IR in parallel. Does that sound
>>>>>> reasonable to everyone?
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <
>>>>>> pi...@starburstdata.com> wrote:
>>>>>>
>>>>>>> Hi Anjali,
>>>>>>>
>>>>>>> That's a nice summary.
>>>>>>>
>>>>>>> re dialect field. It shouldn't be a bit trouble to have it (or any
>>>>>>> other way to identify application that created the view), and it might 
>>>>>>> be
>>>>>>> useful.
>>>>>>> Why not make it required from the start?
>>>>>>>
>>>>>>> re "expanded/resolved SQL" -- i don't understand yet what we would
>>>>>>> put there, so cannot comment.
>>>>>>>
>>>>>>> I agree there it's nice to get something out of the door, and I see
>>>>>>> how the current proposal fits some needs already.
>>>>>>> However, i am concerned about the proliferation of non-cross-engine
>>>>>>> compatible views, if we do that.
>>>>>>>
>>>>>>> Also, if we later agree on any compatible approach (portable subset
>>>>>>> of SQL, engine-agnostic IR, etc.), then from the perspective of each
>>>>>>> engine, it would be a breaking change.
>>>>>>> Unless we make the compatible approach as expressive as full power
>>>>>>> of SQL, some views that are possible to create in v1 will not be 
>>>>>>> possible
>>>>>>> to create in v2.
>>>>>>> Thus, if v1  is "some SQL" and v2 is "something awesomely
>>>>>>> compatible", we may not be able to roll it out.
>>>>>>>
>>>>>>> > the convention of common SQL has been working for a majority of
>>>>>>> users. SQL features commonly used are column projections, simple filter
>>>>>>> application, joins, grouping and common aggregate and scalar function. A
>>>>>>> few users occasionally would like to use Trino or Spark specific 
>>>>>>> functions
>>>>>>> but are sometimes able to find a way to use a function that is common to
>>>>>>> both the engines.
>>>>>>>
>>>>>>>
>>>>>>> it's an awesome summary of what constructs are necessary to be able
>>>>>>> to define useful views, while also keep them portable.
>>>>>>>
>>>>>>> To be able to express column projections, simple filter application,
>>>>>>> joins, grouping and common aggregate and scalar function in a structured
>>>>>>> IR, how much effort do you think would be required?
>>>>>>> We didn't really talk about downsides of a structured approach,
>>>>>>> other than it looks complex.
>>>>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue
>>>>>>> for that. Maybe i were overly optimistic though.
>>>>>>>
>>>>>>>
>>>>>>> As Jack mentioned, for engine-specific approach that's not supposed
>>>>>>> to be consumed by multiple engines, we may be better served with 
>>>>>>> approach
>>>>>>> that's outside of Iceberg spec, like
>>>>>>> https://github.com/trinodb/trino/pull/8540.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>>>>>> <anorw...@netflix.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Thank you for all the comments. I will try to address them all here
>>>>>>>> together.
>>>>>>>>
>>>>>>>>
>>>>>>>>    - @all Cross engine compatibility of view definition: Multiple
>>>>>>>>    options such as engine agnostic SQL or IR of some form have been 
>>>>>>>> mentioned.
>>>>>>>>    We can all agree that all of these options are non-trivial to
>>>>>>>>    design/implement (perhaps a multi-year effort based on the option 
>>>>>>>> chosen)
>>>>>>>>    and merit further discussion. I would like to suggest that we 
>>>>>>>> continue this
>>>>>>>>    discussion but target this work for the future (v2?). In v1, we can 
>>>>>>>> add an
>>>>>>>>    optional dialect field and an optional expanded/resolved SQL field 
>>>>>>>> that can
>>>>>>>>    be interpreted by engines as they see fit. V1 can unlock many use 
>>>>>>>> cases
>>>>>>>>    where the views are either accessed by a single engine or 
>>>>>>>> multi-engine use
>>>>>>>>    cases where a (common) subset of SQL is supported. This proposal 
>>>>>>>> allows for
>>>>>>>>    desirable features such as versioning of views and a common format 
>>>>>>>> of
>>>>>>>>    storing view metadata while allowing extensibility in the future. 
>>>>>>>> *Does
>>>>>>>>    anyone feel strongly otherwise?*
>>>>>>>>    - @Piotr  As for common views at Netflix, the restrictions on
>>>>>>>>    SQL are not enforced, but are advised as best practices. The 
>>>>>>>> convention of
>>>>>>>>    common SQL has been working for a majority of users. SQL features 
>>>>>>>> commonly
>>>>>>>>    used are column projections, simple filter application, joins, 
>>>>>>>> grouping and
>>>>>>>>    common aggregate and scalar function. A few users occasionally 
>>>>>>>> would like
>>>>>>>>    to use Trino or Spark specific functions but are sometimes able to 
>>>>>>>> find a
>>>>>>>>    way to use a function that is common to both the engines.
>>>>>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic and
>>>>>>>>    hence were picked for storing view schema. Thinking further, the 
>>>>>>>> schema
>>>>>>>>    field should be made 'optional', since not all engines require it. 
>>>>>>>> (e.g.
>>>>>>>>    Spark does not need it and Trino uses it only for validation).
>>>>>>>>    - @Jacques Table references in the views can be arbitrary
>>>>>>>>    objects such as tables from other catalogs or elasticsearch tables 
>>>>>>>> etc. I
>>>>>>>>    will clarify it in the spec.
>>>>>>>>
>>>>>>>> I will work on incorporating all the comments in the spec and make
>>>>>>>> the next revision available for review soon.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Anjali.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>>>>>> pi...@starburstdata.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>>>>>
>>>>>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>>>>>> I.e. it would be useful if there was some external app (or a human
>>>>>>>>> being) that would write those alternative SQLs for each dialect.
>>>>>>>>> Otherwise I am not imagining Spark writing SQL for Spark and
>>>>>>>>> Trino, or Trino writing SQL for Trino and Spark.
>>>>>>>>>
>>>>>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>>>>>> While i like the idea of starting with Trino SQL ANTLR grammar
>>>>>>>>> file (it's really well written and resembles spec quite well), you 
>>>>>>>>> made a
>>>>>>>>> good point that grammar is only part of the problem. Coercions, 
>>>>>>>>> function
>>>>>>>>> resolution, dereference resolution, table resolution are part of query
>>>>>>>>> analysis that goes beyond just grammar.
>>>>>>>>> In fact, column scoping rules -- while clearly defined by the spec
>>>>>>>>> -- may easily differ between engines (pretty usual).
>>>>>>>>> That's why i would rather lean towards some intermediate
>>>>>>>>> representation that is *not *SQL, doesn't require parsing (is
>>>>>>>>> already structural), nor analysis (no scopes! no implicit coercions!).
>>>>>>>>> Before we embark on such a journey, it would be interesting to
>>>>>>>>> hear @Martin Traverso <mar...@starburstdata.com> 's thoughts on
>>>>>>>>> feasibility though.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> PF
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>>>>>> jacquesnad...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Some thoughts...
>>>>>>>>>>
>>>>>>>>>>    - In general, many engines want (or may require) a resolved
>>>>>>>>>>    sql field. This--at minimum--typically includes star expansion 
>>>>>>>>>> since
>>>>>>>>>>    traditional view behavior is stars are expanded at view creation 
>>>>>>>>>> time
>>>>>>>>>>    (since this is the only way to guarantee that the view returns 
>>>>>>>>>> the same
>>>>>>>>>>    logical definition even if the underlying table changes). This 
>>>>>>>>>> may also
>>>>>>>>>>    include a replacement of relative object names to absolute object 
>>>>>>>>>> names
>>>>>>>>>>    based on the session catalog & namespace. If I recall correctly, 
>>>>>>>>>> Hive does
>>>>>>>>>>    both of these things.
>>>>>>>>>>    - It isn't clear in the spec whether the table references
>>>>>>>>>>    used in views are restricted to other Iceberg objects or can be 
>>>>>>>>>> arbitrary
>>>>>>>>>>    objects in the context of a particular engine. Maybe I missed 
>>>>>>>>>> this? For
>>>>>>>>>>    example, can I have a Trino engine view that references an 
>>>>>>>>>> Elasticsearch
>>>>>>>>>>    table stored in an Iceberg view?
>>>>>>>>>>    - Restricting schemas to the Iceberg types will likely lead
>>>>>>>>>>    to unintended consequences. I appreciate the attraction to it but 
>>>>>>>>>> I think
>>>>>>>>>>    it would either create artificial barriers around the types of 
>>>>>>>>>> SQL that are
>>>>>>>>>>    allowed and/or mean that replacing a CTE with a view could 
>>>>>>>>>> potentially
>>>>>>>>>>    change the behavior of the query which I believe violates most 
>>>>>>>>>> typical
>>>>>>>>>>    engine behaviors. A good example of this is the simple sql 
>>>>>>>>>> statement of
>>>>>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and 
>>>>>>>>>> Calcite by
>>>>>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this 
>>>>>>>>>> Iceberg
>>>>>>>>>>    context, is this view disallowed? If it isn't disallowed then you 
>>>>>>>>>> have an
>>>>>>>>>>    issue where the view schema will be required to be different from 
>>>>>>>>>> a CTE
>>>>>>>>>>    since the engine will resolve it differently than Iceberg. Even 
>>>>>>>>>> if you
>>>>>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>>>>>    - It is important to remember that Calcite is a set of
>>>>>>>>>>    libraries and not a specification. There are things that can be 
>>>>>>>>>> specified
>>>>>>>>>>    in Calcite but in general it doesn't have formal specification as 
>>>>>>>>>> a first
>>>>>>>>>>    principle. It is more implementation as a first principle. This 
>>>>>>>>>> is in
>>>>>>>>>>    contrast to projects like Arrow and Iceberg, which start with 
>>>>>>>>>> well-formed
>>>>>>>>>>    specifications. I've been working with Calcite since before it 
>>>>>>>>>> was an
>>>>>>>>>>    Apache project and I wouldn't recommend adopting it as any form 
>>>>>>>>>> of a
>>>>>>>>>>    specification. On the flipside, I am very supportive of using it 
>>>>>>>>>> for a
>>>>>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the 
>>>>>>>>>> adoption of
>>>>>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that 
>>>>>>>>>> Spark [1]
>>>>>>>>>>    and/or Trino [2] use. Even that is not a complete specification 
>>>>>>>>>> as grammar
>>>>>>>>>>    must still be interpreted with regards to type promotion, function
>>>>>>>>>>    resolution, consistent unnamed expression naming, etc that aren't 
>>>>>>>>>> defined
>>>>>>>>>>    at the grammar level. I'd definitely avoid using Calcite's JavaCC 
>>>>>>>>>> grammar
>>>>>>>>>>    as it heavily embeds implementation details (in a good way) and 
>>>>>>>>>> relies on
>>>>>>>>>>    some fairly complex logic in the validator and sql2rel components 
>>>>>>>>>> to be
>>>>>>>>>>    fully resolved/comprehended.
>>>>>>>>>>
>>>>>>>>>> Given the above, I suggest having a field which describes the
>>>>>>>>>> dialect(origin?) of the view and then each engine can decide how 
>>>>>>>>>> they want
>>>>>>>>>> to consume/mutate that view (and whether they want to or not). It 
>>>>>>>>>> does risk
>>>>>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>>>>>> establishing a formal SQL specification to be a similarly long 
>>>>>>>>>> process to
>>>>>>>>>> the couple of years it took to build the Arrow and Iceberg 
>>>>>>>>>> specifications.
>>>>>>>>>> (Realistically, there is far more to specify here than there is in 
>>>>>>>>>> either
>>>>>>>>>> of those two domains.)
>>>>>>>>>>
>>>>>>>>>> Some other notes:
>>>>>>>>>>
>>>>>>>>>>    - Calcite does provide a nice reference document [3] but it
>>>>>>>>>>    is not sufficient to implement what is necessary for
>>>>>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>>>>>    - Projects like Coral [4] are interesting here but even Coral
>>>>>>>>>>    is based roughly on "HiveQL" which also doesn't have a formal 
>>>>>>>>>> specification
>>>>>>>>>>    process outside of the Hive version you're running. See this 
>>>>>>>>>> thread in
>>>>>>>>>>    Coral slack [5]
>>>>>>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>>>>>>    closer to specification based [7] than Calcite but is much less 
>>>>>>>>>> popular in
>>>>>>>>>>    the big data domain. I also haven't reviewed it's SQL 
>>>>>>>>>> completeness closely,
>>>>>>>>>>    a strength of Calcite.
>>>>>>>>>>    - One of the other problems with building against an
>>>>>>>>>>    implementation as opposed to a specification (e.g. Calcite) is it 
>>>>>>>>>> can make
>>>>>>>>>>    it difficult or near impossible to implement the same algorithms 
>>>>>>>>>> again
>>>>>>>>>>    without a bunch of reverse engineering. If interested in an 
>>>>>>>>>> example of
>>>>>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet 
>>>>>>>>>> spec [8] for
>>>>>>>>>>    how painful this kind of mistake can become.
>>>>>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>>>>>    actually implements that in its entirety and it has far too many 
>>>>>>>>>> places
>>>>>>>>>>    where things are "implementation-defined" [9].
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>> [2]
>>>>>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>>>>>> [4] https://github.com/linkedin/coral
>>>>>>>>>> [5]
>>>>>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>>>>>> [6] https://github.com/google/zetasql
>>>>>>>>>> [7]
>>>>>>>>>> https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>>>>>> [8]
>>>>>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>>>>>> interoperability here, I will add my response in the design doc 
>>>>>>>>>>> here.
>>>>>>>>>>>
>>>>>>>>>>> I would personally prefer cross-engine interoperability as a
>>>>>>>>>>> goal and get the spec in the right structure in the initial release,
>>>>>>>>>>> because:
>>>>>>>>>>>
>>>>>>>>>>> 1. I believe that cross-engine compatibility is a critical
>>>>>>>>>>> feature of Iceberg. If I am a user of an existing data lake that 
>>>>>>>>>>> already
>>>>>>>>>>> supports views (e.g. Hive), I don't even need Iceberg to have this 
>>>>>>>>>>> view
>>>>>>>>>>> feature. I can do what is now done for Trino to use views with 
>>>>>>>>>>> Iceberg. I
>>>>>>>>>>> can also just use a table property to indicate the table is a view 
>>>>>>>>>>> and
>>>>>>>>>>> store the view SQL as a table property and do my own thing in any 
>>>>>>>>>>> query
>>>>>>>>>>> engine to support all the view features. One of the most valuable 
>>>>>>>>>>> and
>>>>>>>>>>> unique features that Iceberg view can unlock is to allow a view to 
>>>>>>>>>>> be
>>>>>>>>>>> created in one engine and read by another. Not supporting 
>>>>>>>>>>> cross-engine
>>>>>>>>>>> compatibility feels like losing a lot of value to me.
>>>>>>>>>>>
>>>>>>>>>>> 2. In the view definition, it feels inconsistent to me that we
>>>>>>>>>>> have "schema" as an Iceberg native schema, but "sql" field as the 
>>>>>>>>>>> view SQL
>>>>>>>>>>> that can come from any query engine. If the engine already needs to 
>>>>>>>>>>> convert
>>>>>>>>>>> the view schema to iceberg shema, it should just do the same for 
>>>>>>>>>>> the view
>>>>>>>>>>> SQL.
>>>>>>>>>>>
>>>>>>>>>>> Regarding the way to achieve it, I think it comes to either
>>>>>>>>>>> Apache Calcite (or some other third party alternative I don't know) 
>>>>>>>>>>> or our
>>>>>>>>>>> own implementation of some intermediate representation. I don't 
>>>>>>>>>>> have a very
>>>>>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>>>>>
>>>>>>>>>>> 1. Calcite is supposed to be the go-to software to deal with
>>>>>>>>>>> this kind of issue, but my personal concern is that the integration 
>>>>>>>>>>> is
>>>>>>>>>>> definitely going to be much more involved, and it will become 
>>>>>>>>>>> another
>>>>>>>>>>> barrier for newer engines to onboard because it not only needs to 
>>>>>>>>>>> implement
>>>>>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to 
>>>>>>>>>>> become a
>>>>>>>>>>> constant discussion around what we maintain and what we should push 
>>>>>>>>>>> to
>>>>>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>>>>>
>>>>>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is
>>>>>>>>>>> to have a native lightweight logical query structure representation 
>>>>>>>>>>> of the
>>>>>>>>>>> view SQL and store that instead of the SQL string. We already deal 
>>>>>>>>>>> with
>>>>>>>>>>> Expressions in Iceberg, and engines have to convert predicates to 
>>>>>>>>>>> Iceberg
>>>>>>>>>>> expressions for predicate pushdown. So I think it would not be hard 
>>>>>>>>>>> to
>>>>>>>>>>> extend on that to support this use case. Different engines can 
>>>>>>>>>>> build this
>>>>>>>>>>> logical structure when traversing their own AST during a create 
>>>>>>>>>>> view query.
>>>>>>>>>>>
>>>>>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is 
>>>>>>>>>>> the engine
>>>>>>>>>>> type and version like "Spark 3.1", and value is the view SQL 
>>>>>>>>>>> string. In
>>>>>>>>>>> this way, the engine that creates the view can still read the SQL 
>>>>>>>>>>> directly
>>>>>>>>>>> which might lead to better engine-native integration and avoid 
>>>>>>>>>>> redundant
>>>>>>>>>>> parsing. But in this approach there is always a default intermediate
>>>>>>>>>>> representation it can fallback to when the engine's key is not 
>>>>>>>>>>> found in the
>>>>>>>>>>> map. If we want to make incremental progress and delay the design 
>>>>>>>>>>> for the
>>>>>>>>>>> intermediate representation, I think we should at least use this map
>>>>>>>>>>> instead of just a single string.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Jack Ye
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>>>>>> pi...@starburstdata.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> First of all thank you for this discussion and all the
>>>>>>>>>>>> view-related work!
>>>>>>>>>>>>
>>>>>>>>>>>> I agree that solving cross-engine compatibility problem may not
>>>>>>>>>>>> be primary feature today, I am concerned that not thinking about 
>>>>>>>>>>>> this from
>>>>>>>>>>>> the start may "tunnel" us into a wrong direction.
>>>>>>>>>>>> Cross-engine compatible views would be such a cool feature that
>>>>>>>>>>>> it is hard to just let it pass.
>>>>>>>>>>>>
>>>>>>>>>>>> My thinking about a smaller IR may be a side-product of me not
>>>>>>>>>>>> being familiar enough with Calcite.
>>>>>>>>>>>> However, with new IR being focused on compatible
>>>>>>>>>>>> representation, and not being tied to anything are actually good 
>>>>>>>>>>>> things.
>>>>>>>>>>>> For example, we need to focus on JSON representation, but we
>>>>>>>>>>>> don't need to deal with tree traversal or anything, so the code 
>>>>>>>>>>>> for this
>>>>>>>>>>>> could be pretty simple.
>>>>>>>>>>>>
>>>>>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>> common across engines in the view definition (this is how 
>>>>>>>>>>>> currently Netflix
>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>
>>>>>>>>>>>> it's interesting. Anjali, do  you have means to enforce that,
>>>>>>>>>>>> or is this just a convention?
>>>>>>>>>>>>
>>>>>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>>>>>> constructs and functions) that you found sufficient for expressing 
>>>>>>>>>>>> your
>>>>>>>>>>>> views?
>>>>>>>>>>>> Being able to enumerate them could help validate various
>>>>>>>>>>>> approaches considered here, including feasibility of dedicated
>>>>>>>>>>>> representation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> PF
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <rym...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Anjali,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am definitely happy to help with implementing 1-3 in your
>>>>>>>>>>>>> first list once the spec has been approved by the community. My 
>>>>>>>>>>>>> hope is
>>>>>>>>>>>>> that the final version of the view spec will make it easy to
>>>>>>>>>>>>> re-use existing rollback/time travel/metadata etc functionalities.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>>>>>> desirable while storing Calcite makes it hard for eg python to 
>>>>>>>>>>>>> use views. A
>>>>>>>>>>>>> project to make a cross language and cross engine IR for sql 
>>>>>>>>>>>>> views and the
>>>>>>>>>>>>> relevant transpilers is imho outside the scope of this spec and 
>>>>>>>>>>>>> probably
>>>>>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr 
>>>>>>>>>>>>> suggested
>>>>>>>>>>>>> is possible but I feel it will likely quickly snowball into a 
>>>>>>>>>>>>> larger
>>>>>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I 
>>>>>>>>>>>>> think the
>>>>>>>>>>>>> most reasonable way forward is to add a dialect field and a 
>>>>>>>>>>>>> warning to
>>>>>>>>>>>>> engines that views are not (yet) cross compatible. This is at 
>>>>>>>>>>>>> odds with the
>>>>>>>>>>>>> original spirit of iceberg tables and I wonder how the border 
>>>>>>>>>>>>> community
>>>>>>>>>>>>> feels about it? I would hope that we can make the view spec 
>>>>>>>>>>>>> engine-free
>>>>>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> PS if anyone is interested in collaborating on engine agnostic
>>>>>>>>>>>>> views please reach out. I am keen on exploring this topic.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>>>>>> <anorw...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I
>>>>>>>>>>>>>> have and will continue to address them in the doc.
>>>>>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks to everybody who has offered help with implementation.
>>>>>>>>>>>>>> The spec as it is proposed in the doc has been implemented and 
>>>>>>>>>>>>>> is in use at
>>>>>>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we 
>>>>>>>>>>>>>> will rebase
>>>>>>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>>>>>>>>>> other feedback from the community and should be able to make 
>>>>>>>>>>>>>> this MVP
>>>>>>>>>>>>>> implementation available quickly as a PR.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A few areas that we have not yet worked on and would love for
>>>>>>>>>>>>>> the community to help are:
>>>>>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>>>>>> version or time
>>>>>>>>>>>>>> 2. History table: A system table implementation for $versions
>>>>>>>>>>>>>> similar to the $snapshots table in order to display the history 
>>>>>>>>>>>>>> of a view
>>>>>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a
>>>>>>>>>>>>>> previous version
>>>>>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One comment that is worth a broader discussion is the dialect
>>>>>>>>>>>>>> of the SQL stored in the view metadata. The purpose of the spec 
>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>> provide a storage format for view metadata and APIs to access 
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> metadata. The dialect of the SQL stored is an orthogonal 
>>>>>>>>>>>>>> question and is
>>>>>>>>>>>>>> outside the scope of this spec.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>>>>>> suggestions that came up in the comments to continue the 
>>>>>>>>>>>>>> discussion:
>>>>>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>>>> common across engines in the view definition (this is how 
>>>>>>>>>>>>>> currently Netflix
>>>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect
>>>>>>>>>>>>>> of the SQL. This allows for any desired dialect, but no improved
>>>>>>>>>>>>>> cross-engine operability
>>>>>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>>>>>> 4. Intermediate structured language of our own. (What
>>>>>>>>>>>>>> additional functionality does it provide over Calcite?)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Given that the view metadata is json, it is easily extendable
>>>>>>>>>>>>>> to incorporate any new fields needed to make the SQL truly 
>>>>>>>>>>>>>> compatible
>>>>>>>>>>>>>> across engines.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>> Anjali
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>>>>>> pi...@starburstdata.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anjali, your document does not talk much about compatibility
>>>>>>>>>>>>>>> between query engines.
>>>>>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive.
>>>>>>>>>>>>>>> I treat it as a great technology supporting transitioning
>>>>>>>>>>>>>>> from a query engine to a better one.
>>>>>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>>>>>> This would eliminate need for cross-engine compatible
>>>>>>>>>>>>>>> parsing and analysis.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>>> PF
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <
>>>>>>>>>>>>>>> rym...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have left some comments on the document. I unfortunately
>>>>>>>>>>>>>>>> have to miss the community meetup tomorrow but would love to 
>>>>>>>>>>>>>>>> chat more/help
>>>>>>>>>>>>>>>> w/ implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>>>>>> <anorw...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec
>>>>>>>>>>>>>>>>> for storing view metadata in Iceberg. The proposal has been 
>>>>>>>>>>>>>>>>> implemented [1]
>>>>>>>>>>>>>>>>> and is in production at Netflix for over 15 months.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please let us know your thoughts by adding comments to the
>>>>>>>>>>>>>>>>> doc.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Proposal: Support for views in Iceberg

Reply via email to