Re: Proposal: Support for views in Iceberg

Jacques Nadeau Thu, 26 Aug 2021 14:36:44 -0700

The original proposal was more focused on executable plans (as opposed to
the logical plans). However, I'm arguing that it should include both
logical and physical since the two are fairly heavily overlapping. (See how
much logic is in the base classes for most Calcite operators as opposed to
the concrete logical or physical variations.) If other people agree with my
proposal that Arrow IR should be able to express both logical and physical,
then my hope was it could potentially serve the purposes discussed here for
an engine-agnostic version of a view.


+Phillip Cloud <[email protected]> here as he is actively working on the
Arrow IR stuff (i've also suggested he subscribe to iceberg dev).



On Thu, Aug 26, 2021 at 1:51 PM Micah Kornfield <[email protected]>
wrote:

> I'll point a link to this thread on dev@arrow so people not subscribed to
> both can way in?  Should we maybe start a new thread about IR?
>
>
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit
> is the document that Wes wrote up that started the conversation.
>
> On Thu, Aug 26, 2021 at 1:42 PM Ryan Blue <[email protected]> wrote:
>
>> Micah or someone more familiar (Jacques?), could you summarize some of
>> the details of the Arrow IR proposal? What are the intended use cases and
>> goals? We'd want to see whether the goals align with our own.
>>
>> On Thu, Aug 26, 2021 at 9:24 AM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Small tangent:
>>>
>>> Regarding an IR, Apache Arrow is looking at this now [1] with several
>>> RFC proposals.  It would be nice to coordinate, at least to make sure
>>> translation from one IR to another isn't too onerous (in an ideal world we
>>> could maybe share the same one).  Either way external feedback would be
>>> useful.
>>>
>>> Cheers,
>>> Micah
>>>
>>>
>>> [1]
>>> https://mail-archives.apache.org/mod_mbox/arrow-dev/202108.mbox/%3cCAKa9qDmLq_4dcmFTJJWpp=ejd0qpl9lygynq2jagxo+lfer...@mail.gmail.com%3e
>>>
>>> On Thu, Aug 26, 2021 at 8:21 AM Ryan Blue <[email protected]> wrote:
>>>
>>>> I think that the current proposal is looking good, but it is always a
>>>> good idea to give people a few days to review it and bring up any issues or
>>>> further discussion on the topics in this thread.
>>>>
>>>> I'll also add this to the next sync agenda so we can farm for dissent
>>>> next week. Sometimes you can get a better read on what is still concerning
>>>> in person.
>>>>
>>>> Ryan
>>>>
>>>> On Wed, Aug 25, 2021 at 9:46 PM Anjali Norwood
>>>> <[email protected]> wrote:
>>>>
>>>>> Hello All,
>>>>> I answered all the comments and made changes to the spec to reflect
>>>>> them .. the doc now shows the new revision (link here:
>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit#
>>>>> ).
>>>>> Please diff with the version named 'Spec v1' in order to see the
>>>>> deltas. Please let me know if any of your comments are not
>>>>> satisfactorily addressed.
>>>>>
>>>>> @Ryan, Thank you for the insightful feedback, especially around the
>>>>> use of Iceberg data types for schema and possible evolution.
>>>>> I agree with your comment: *"I think we should move forward with the
>>>>> proposal as it is, and pursue type annotations and possibly IR in
>>>>> parallel."*
>>>>> How do we achieve consensus/farm for dissent and move forward with the
>>>>> proposal?
>>>>>
>>>>> thanks,
>>>>> Anjali.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <[email protected]> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks for working on this, Anjali! It’s great to see the thorough
>>>>>>> discussion here with everyone.
>>>>>>>
>>>>>>> For the discussion about SQL dialect, I think that the right first
>>>>>>> step is to capture the SQL or query dialect. That will give us the most
>>>>>>> flexibility. Engines that can use Coral for translation can attempt to
>>>>>>> convert and engines that don’t can see if the SQL is valid and can be 
>>>>>>> used.
>>>>>>>
>>>>>>> I think that the idea to create a minimal IR is an interesting one,
>>>>>>> but can be added later. We will always need to record the SQL and 
>>>>>>> dialect,
>>>>>>> even if we translate to IR because users are going to configure views 
>>>>>>> using
>>>>>>> SQL. Uses like showing view history or debugging need to show the 
>>>>>>> original
>>>>>>> SQL, plus relevant information like where it was created and the SQL
>>>>>>> dialect. We should be able to add this later by adding additional 
>>>>>>> metadata
>>>>>>> to the view definition. I don’t think that it would introduce breaking
>>>>>>> changes to add a common representation that can be optionally consumed.
>>>>>>>
>>>>>>> Let’s continue talking about a minimal IR, separately. View
>>>>>>> translation is a hard problem. Right now, to get views across engines we
>>>>>>> have to compromise confidence. IR is a way to have strong confidence, 
>>>>>>> but
>>>>>>> with limited expressibility. I think that’s a good trade in a lot of 
>>>>>>> cases
>>>>>>> and is worth pursuing, even if it will take a long time.
>>>>>>>
>>>>>>> Jacques makes a great point about types, but I think that the right
>>>>>>> option here is to continue using Iceberg types. We’ve already had
>>>>>>> discussions about whether Iceberg should support annotating types with
>>>>>>> engine-specific ones, so we have a reasonable way to improve this while
>>>>>>> also providing compatibility across engines: char(n) is not
>>>>>>> necessarily supported everywhere and mapping it to string will make
>>>>>>> sense in most places. The schema is primarily used to validate that the
>>>>>>> data produced by the query hasn’t changed and that is more about the 
>>>>>>> number
>>>>>>> of columns in structs and the names of fields rather than exact types. 
>>>>>>> We
>>>>>>> can fix up types when substituting without losing too much: if the SQL
>>>>>>> produces a varchar(10) field that the view metadata says is a string,
>>>>>>> then it’s okay that it is varchar(10). There is some loss in that
>>>>>>> we don’t know if it was originally varchar(5), but I think that
>>>>>>> this is not going to cause too many issues. Not all engines will even
>>>>>>> validate that the schema has not changed, since it could be valid to 
>>>>>>> use select
>>>>>>> * from x where ... and allow new fields to appear.
>>>>>>>
>>>>>>> Right now, I think we should move forward with the proposal as it
>>>>>>> is, and pursue type annotations and possibly IR in parallel. Does that
>>>>>>> sound reasonable to everyone?
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Anjali,
>>>>>>>>
>>>>>>>> That's a nice summary.
>>>>>>>>
>>>>>>>> re dialect field. It shouldn't be a bit trouble to have it (or any
>>>>>>>> other way to identify application that created the view), and it might 
>>>>>>>> be
>>>>>>>> useful.
>>>>>>>> Why not make it required from the start?
>>>>>>>>
>>>>>>>> re "expanded/resolved SQL" -- i don't understand yet what we would
>>>>>>>> put there, so cannot comment.
>>>>>>>>
>>>>>>>> I agree there it's nice to get something out of the door, and I see
>>>>>>>> how the current proposal fits some needs already.
>>>>>>>> However, i am concerned about the proliferation of non-cross-engine
>>>>>>>> compatible views, if we do that.
>>>>>>>>
>>>>>>>> Also, if we later agree on any compatible approach (portable subset
>>>>>>>> of SQL, engine-agnostic IR, etc.), then from the perspective of each
>>>>>>>> engine, it would be a breaking change.
>>>>>>>> Unless we make the compatible approach as expressive as full power
>>>>>>>> of SQL, some views that are possible to create in v1 will not be 
>>>>>>>> possible
>>>>>>>> to create in v2.
>>>>>>>> Thus, if v1  is "some SQL" and v2 is "something awesomely
>>>>>>>> compatible", we may not be able to roll it out.
>>>>>>>>
>>>>>>>> > the convention of common SQL has been working for a majority of
>>>>>>>> users. SQL features commonly used are column projections, simple filter
>>>>>>>> application, joins, grouping and common aggregate and scalar function. 
>>>>>>>> A
>>>>>>>> few users occasionally would like to use Trino or Spark specific 
>>>>>>>> functions
>>>>>>>> but are sometimes able to find a way to use a function that is common 
>>>>>>>> to
>>>>>>>> both the engines.
>>>>>>>>
>>>>>>>>
>>>>>>>> it's an awesome summary of what constructs are necessary to be able
>>>>>>>> to define useful views, while also keep them portable.
>>>>>>>>
>>>>>>>> To be able to express column projections, simple filter
>>>>>>>> application, joins, grouping and common aggregate and scalar function 
>>>>>>>> in a
>>>>>>>> structured IR, how much effort do you think would be required?
>>>>>>>> We didn't really talk about downsides of a structured approach,
>>>>>>>> other than it looks complex.
>>>>>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue
>>>>>>>> for that. Maybe i were overly optimistic though.
>>>>>>>>
>>>>>>>>
>>>>>>>> As Jack mentioned, for engine-specific approach that's not supposed
>>>>>>>> to be consumed by multiple engines, we may be better served with 
>>>>>>>> approach
>>>>>>>> that's outside of Iceberg spec, like
>>>>>>>> https://github.com/trinodb/trino/pull/8540.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> PF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thank you for all the comments. I will try to address them all
>>>>>>>>> here together.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - @all Cross engine compatibility of view definition: Multiple
>>>>>>>>>    options such as engine agnostic SQL or IR of some form have been 
>>>>>>>>> mentioned.
>>>>>>>>>    We can all agree that all of these options are non-trivial to
>>>>>>>>>    design/implement (perhaps a multi-year effort based on the option 
>>>>>>>>> chosen)
>>>>>>>>>    and merit further discussion. I would like to suggest that we 
>>>>>>>>> continue this
>>>>>>>>>    discussion but target this work for the future (v2?). In v1, we 
>>>>>>>>> can add an
>>>>>>>>>    optional dialect field and an optional expanded/resolved SQL field 
>>>>>>>>> that can
>>>>>>>>>    be interpreted by engines as they see fit. V1 can unlock many use 
>>>>>>>>> cases
>>>>>>>>>    where the views are either accessed by a single engine or 
>>>>>>>>> multi-engine use
>>>>>>>>>    cases where a (common) subset of SQL is supported. This proposal 
>>>>>>>>> allows for
>>>>>>>>>    desirable features such as versioning of views and a common format 
>>>>>>>>> of
>>>>>>>>>    storing view metadata while allowing extensibility in the future. 
>>>>>>>>> *Does
>>>>>>>>>    anyone feel strongly otherwise?*
>>>>>>>>>    - @Piotr  As for common views at Netflix, the restrictions on
>>>>>>>>>    SQL are not enforced, but are advised as best practices. The 
>>>>>>>>> convention of
>>>>>>>>>    common SQL has been working for a majority of users. SQL features 
>>>>>>>>> commonly
>>>>>>>>>    used are column projections, simple filter application, joins, 
>>>>>>>>> grouping and
>>>>>>>>>    common aggregate and scalar function. A few users occasionally 
>>>>>>>>> would like
>>>>>>>>>    to use Trino or Spark specific functions but are sometimes able to 
>>>>>>>>> find a
>>>>>>>>>    way to use a function that is common to both the engines.
>>>>>>>>>    - @Jacques and @Jack Iceberg data types are engine agnostic
>>>>>>>>>    and hence were picked for storing view schema. Thinking further, 
>>>>>>>>> the schema
>>>>>>>>>    field should be made 'optional', since not all engines require it. 
>>>>>>>>> (e.g.
>>>>>>>>>    Spark does not need it and Trino uses it only for validation).
>>>>>>>>>    - @Jacques Table references in the views can be arbitrary
>>>>>>>>>    objects such as tables from other catalogs or elasticsearch tables 
>>>>>>>>> etc. I
>>>>>>>>>    will clarify it in the spec.
>>>>>>>>>
>>>>>>>>> I will work on incorporating all the comments in the spec and make
>>>>>>>>> the next revision available for review soon.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Anjali.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Thanks Jack and Jacques for sharing your thoughts.
>>>>>>>>>>
>>>>>>>>>> I agree that tracking  dialect/origin is better than nothing.
>>>>>>>>>> I think having a Map {dialect: sql} is not going to buy us much.
>>>>>>>>>> I.e. it would be useful if there was some external app (or a
>>>>>>>>>> human being) that would write those alternative SQLs for each 
>>>>>>>>>> dialect.
>>>>>>>>>> Otherwise I am not imagining Spark writing SQL for Spark and
>>>>>>>>>> Trino, or Trino writing SQL for Trino and Spark.
>>>>>>>>>>
>>>>>>>>>> Thanks Jacques for a good summary of SQL supporting options.
>>>>>>>>>> While i like the idea of starting with Trino SQL ANTLR grammar
>>>>>>>>>> file (it's really well written and resembles spec quite well), you 
>>>>>>>>>> made a
>>>>>>>>>> good point that grammar is only part of the problem. Coercions, 
>>>>>>>>>> function
>>>>>>>>>> resolution, dereference resolution, table resolution are part of 
>>>>>>>>>> query
>>>>>>>>>> analysis that goes beyond just grammar.
>>>>>>>>>> In fact, column scoping rules -- while clearly defined by the
>>>>>>>>>> spec -- may easily differ between engines (pretty usual).
>>>>>>>>>> That's why i would rather lean towards some intermediate
>>>>>>>>>> representation that is *not *SQL, doesn't require parsing (is
>>>>>>>>>> already structural), nor analysis (no scopes! no implicit 
>>>>>>>>>> coercions!).
>>>>>>>>>> Before we embark on such a journey, it would be interesting to
>>>>>>>>>> hear @Martin Traverso <[email protected]> 's thoughts on
>>>>>>>>>> feasibility though.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> PF
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Some thoughts...
>>>>>>>>>>>
>>>>>>>>>>>    - In general, many engines want (or may require) a resolved
>>>>>>>>>>>    sql field. This--at minimum--typically includes star expansion 
>>>>>>>>>>> since
>>>>>>>>>>>    traditional view behavior is stars are expanded at view creation 
>>>>>>>>>>> time
>>>>>>>>>>>    (since this is the only way to guarantee that the view returns 
>>>>>>>>>>> the same
>>>>>>>>>>>    logical definition even if the underlying table changes). This 
>>>>>>>>>>> may also
>>>>>>>>>>>    include a replacement of relative object names to absolute 
>>>>>>>>>>> object names
>>>>>>>>>>>    based on the session catalog & namespace. If I recall correctly, 
>>>>>>>>>>> Hive does
>>>>>>>>>>>    both of these things.
>>>>>>>>>>>    - It isn't clear in the spec whether the table references
>>>>>>>>>>>    used in views are restricted to other Iceberg objects or can be 
>>>>>>>>>>> arbitrary
>>>>>>>>>>>    objects in the context of a particular engine. Maybe I missed 
>>>>>>>>>>> this? For
>>>>>>>>>>>    example, can I have a Trino engine view that references an 
>>>>>>>>>>> Elasticsearch
>>>>>>>>>>>    table stored in an Iceberg view?
>>>>>>>>>>>    - Restricting schemas to the Iceberg types will likely lead
>>>>>>>>>>>    to unintended consequences. I appreciate the attraction to it 
>>>>>>>>>>> but I think
>>>>>>>>>>>    it would either create artificial barriers around the types of 
>>>>>>>>>>> SQL that are
>>>>>>>>>>>    allowed and/or mean that replacing a CTE with a view could 
>>>>>>>>>>> potentially
>>>>>>>>>>>    change the behavior of the query which I believe violates most 
>>>>>>>>>>> typical
>>>>>>>>>>>    engine behaviors. A good example of this is the simple sql 
>>>>>>>>>>> statement of
>>>>>>>>>>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and 
>>>>>>>>>>> Calcite by
>>>>>>>>>>>    default I believe), c2 will be specified as a CHAR(3). In this 
>>>>>>>>>>> Iceberg
>>>>>>>>>>>    context, is this view disallowed? If it isn't disallowed then 
>>>>>>>>>>> you have an
>>>>>>>>>>>    issue where the view schema will be required to be different 
>>>>>>>>>>> from a CTE
>>>>>>>>>>>    since the engine will resolve it differently than Iceberg. Even 
>>>>>>>>>>> if you
>>>>>>>>>>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>>>>>>>>>>    - It is important to remember that Calcite is a set of
>>>>>>>>>>>    libraries and not a specification. There are things that can be 
>>>>>>>>>>> specified
>>>>>>>>>>>    in Calcite but in general it doesn't have formal specification 
>>>>>>>>>>> as a first
>>>>>>>>>>>    principle. It is more implementation as a first principle. This 
>>>>>>>>>>> is in
>>>>>>>>>>>    contrast to projects like Arrow and Iceberg, which start with 
>>>>>>>>>>> well-formed
>>>>>>>>>>>    specifications. I've been working with Calcite since before it 
>>>>>>>>>>> was an
>>>>>>>>>>>    Apache project and I wouldn't recommend adopting it as any form 
>>>>>>>>>>> of a
>>>>>>>>>>>    specification. On the flipside, I am very supportive of using it 
>>>>>>>>>>> for a
>>>>>>>>>>>    reference implementation standard for Iceberg view consumption,
>>>>>>>>>>>    manipulation, etc.  If anything, I'd suggest we start with the 
>>>>>>>>>>> adoption of
>>>>>>>>>>>    a relatively clear grammar, e.g. the Antlr grammar file that 
>>>>>>>>>>> Spark [1]
>>>>>>>>>>>    and/or Trino [2] use. Even that is not a complete specification 
>>>>>>>>>>> as grammar
>>>>>>>>>>>    must still be interpreted with regards to type promotion, 
>>>>>>>>>>> function
>>>>>>>>>>>    resolution, consistent unnamed expression naming, etc that 
>>>>>>>>>>> aren't defined
>>>>>>>>>>>    at the grammar level. I'd definitely avoid using Calcite's 
>>>>>>>>>>> JavaCC grammar
>>>>>>>>>>>    as it heavily embeds implementation details (in a good way) and 
>>>>>>>>>>> relies on
>>>>>>>>>>>    some fairly complex logic in the validator and sql2rel 
>>>>>>>>>>> components to be
>>>>>>>>>>>    fully resolved/comprehended.
>>>>>>>>>>>
>>>>>>>>>>> Given the above, I suggest having a field which describes the
>>>>>>>>>>> dialect(origin?) of the view and then each engine can decide how 
>>>>>>>>>>> they want
>>>>>>>>>>> to consume/mutate that view (and whether they want to or not). It 
>>>>>>>>>>> does risk
>>>>>>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of
>>>>>>>>>>> establishing a formal SQL specification to be a similarly long 
>>>>>>>>>>> process to
>>>>>>>>>>> the couple of years it took to build the Arrow and Iceberg 
>>>>>>>>>>> specifications.
>>>>>>>>>>> (Realistically, there is far more to specify here than there is in 
>>>>>>>>>>> either
>>>>>>>>>>> of those two domains.)
>>>>>>>>>>>
>>>>>>>>>>> Some other notes:
>>>>>>>>>>>
>>>>>>>>>>>    - Calcite does provide a nice reference document [3] but it
>>>>>>>>>>>    is not sufficient to implement what is necessary for
>>>>>>>>>>>    parsing/validating/resolving a SQL string correctly/consistently.
>>>>>>>>>>>    - Projects like Coral [4] are interesting here but even
>>>>>>>>>>>    Coral is based roughly on "HiveQL" which also doesn't have a 
>>>>>>>>>>> formal
>>>>>>>>>>>    specification process outside of the Hive version you're 
>>>>>>>>>>> running. See this
>>>>>>>>>>>    thread in Coral slack [5]
>>>>>>>>>>>    - ZetaSQL [6] also seems interesting in this space. It feels
>>>>>>>>>>>    closer to specification based [7] than Calcite but is much less 
>>>>>>>>>>> popular in
>>>>>>>>>>>    the big data domain. I also haven't reviewed it's SQL 
>>>>>>>>>>> completeness closely,
>>>>>>>>>>>    a strength of Calcite.
>>>>>>>>>>>    - One of the other problems with building against an
>>>>>>>>>>>    implementation as opposed to a specification (e.g. Calcite) is 
>>>>>>>>>>> it can make
>>>>>>>>>>>    it difficult or near impossible to implement the same algorithms 
>>>>>>>>>>> again
>>>>>>>>>>>    without a bunch of reverse engineering. If interested in an 
>>>>>>>>>>> example of
>>>>>>>>>>>    this, see the discussion behind LZ4 deprecation on the Parquet 
>>>>>>>>>>> spec [8] for
>>>>>>>>>>>    how painful this kind of mistake can become.
>>>>>>>>>>>    - I'd love to use the SQL specification itself but nobody
>>>>>>>>>>>    actually implements that in its entirety and it has far too many 
>>>>>>>>>>> places
>>>>>>>>>>>    where things are "implementation-defined" [9].
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>> [2]
>>>>>>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>>>>>>>>>>> [3] https://calcite.apache.org/docs/reference.html
>>>>>>>>>>> [4] https://github.com/linkedin/coral
>>>>>>>>>>> [5]
>>>>>>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>>>>>>>>>>> [6] https://github.com/google/zetasql
>>>>>>>>>>> [7]
>>>>>>>>>>> https://github.com/google/zetasql/blob/master/docs/one-pager.md
>>>>>>>>>>> [8]
>>>>>>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>>>>>>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Did not notice that we are also discussing cross-engine
>>>>>>>>>>>> interoperability here, I will add my response in the design doc 
>>>>>>>>>>>> here.
>>>>>>>>>>>>
>>>>>>>>>>>> I would personally prefer cross-engine interoperability as a
>>>>>>>>>>>> goal and get the spec in the right structure in the initial 
>>>>>>>>>>>> release,
>>>>>>>>>>>> because:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. I believe that cross-engine compatibility is a critical
>>>>>>>>>>>> feature of Iceberg. If I am a user of an existing data lake that 
>>>>>>>>>>>> already
>>>>>>>>>>>> supports views (e.g. Hive), I don't even need Iceberg to have this 
>>>>>>>>>>>> view
>>>>>>>>>>>> feature. I can do what is now done for Trino to use views with 
>>>>>>>>>>>> Iceberg. I
>>>>>>>>>>>> can also just use a table property to indicate the table is a view 
>>>>>>>>>>>> and
>>>>>>>>>>>> store the view SQL as a table property and do my own thing in any 
>>>>>>>>>>>> query
>>>>>>>>>>>> engine to support all the view features. One of the most valuable 
>>>>>>>>>>>> and
>>>>>>>>>>>> unique features that Iceberg view can unlock is to allow a view to 
>>>>>>>>>>>> be
>>>>>>>>>>>> created in one engine and read by another. Not supporting 
>>>>>>>>>>>> cross-engine
>>>>>>>>>>>> compatibility feels like losing a lot of value to me.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. In the view definition, it feels inconsistent to me that we
>>>>>>>>>>>> have "schema" as an Iceberg native schema, but "sql" field as the 
>>>>>>>>>>>> view SQL
>>>>>>>>>>>> that can come from any query engine. If the engine already needs 
>>>>>>>>>>>> to convert
>>>>>>>>>>>> the view schema to iceberg shema, it should just do the same for 
>>>>>>>>>>>> the view
>>>>>>>>>>>> SQL.
>>>>>>>>>>>>
>>>>>>>>>>>> Regarding the way to achieve it, I think it comes to either
>>>>>>>>>>>> Apache Calcite (or some other third party alternative I don't 
>>>>>>>>>>>> know) or our
>>>>>>>>>>>> own implementation of some intermediate representation. I don't 
>>>>>>>>>>>> have a very
>>>>>>>>>>>> strong opinion, but my thoughts are the following:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Calcite is supposed to be the go-to software to deal with
>>>>>>>>>>>> this kind of issue, but my personal concern is that the 
>>>>>>>>>>>> integration is
>>>>>>>>>>>> definitely going to be much more involved, and it will become 
>>>>>>>>>>>> another
>>>>>>>>>>>> barrier for newer engines to onboard because it not only needs to 
>>>>>>>>>>>> implement
>>>>>>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to 
>>>>>>>>>>>> become a
>>>>>>>>>>>> constant discussion around what we maintain and what we should 
>>>>>>>>>>>> push to
>>>>>>>>>>>> Calcite, similar to our situation today with Spark.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested,
>>>>>>>>>>>> is to have a native lightweight logical query structure 
>>>>>>>>>>>> representation of
>>>>>>>>>>>> the view SQL and store that instead of the SQL string. We already 
>>>>>>>>>>>> deal with
>>>>>>>>>>>> Expressions in Iceberg, and engines have to convert predicates to 
>>>>>>>>>>>> Iceberg
>>>>>>>>>>>> expressions for predicate pushdown. So I think it would not be 
>>>>>>>>>>>> hard to
>>>>>>>>>>>> extend on that to support this use case. Different engines can 
>>>>>>>>>>>> build this
>>>>>>>>>>>> logical structure when traversing their own AST during a create 
>>>>>>>>>>>> view query.
>>>>>>>>>>>>
>>>>>>>>>>>> 3. With these considerations, I think the "sql" field can
>>>>>>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is 
>>>>>>>>>>>> the engine
>>>>>>>>>>>> type and version like "Spark 3.1", and value is the view SQL 
>>>>>>>>>>>> string. In
>>>>>>>>>>>> this way, the engine that creates the view can still read the SQL 
>>>>>>>>>>>> directly
>>>>>>>>>>>> which might lead to better engine-native integration and avoid 
>>>>>>>>>>>> redundant
>>>>>>>>>>>> parsing. But in this approach there is always a default 
>>>>>>>>>>>> intermediate
>>>>>>>>>>>> representation it can fallback to when the engine's key is not 
>>>>>>>>>>>> found in the
>>>>>>>>>>>> map. If we want to make incremental progress and delay the design 
>>>>>>>>>>>> for the
>>>>>>>>>>>> intermediate representation, I think we should at least use this 
>>>>>>>>>>>> map
>>>>>>>>>>>> instead of just a single string.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> First of all thank you for this discussion and all the
>>>>>>>>>>>>> view-related work!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree that solving cross-engine compatibility problem may
>>>>>>>>>>>>> not be primary feature today, I am concerned that not thinking 
>>>>>>>>>>>>> about this
>>>>>>>>>>>>> from the start may "tunnel" us into a wrong direction.
>>>>>>>>>>>>> Cross-engine compatible views would be such a cool feature
>>>>>>>>>>>>> that it is hard to just let it pass.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My thinking about a smaller IR may be a side-product of me not
>>>>>>>>>>>>> being familiar enough with Calcite.
>>>>>>>>>>>>> However, with new IR being focused on compatible
>>>>>>>>>>>>> representation, and not being tied to anything are actually good 
>>>>>>>>>>>>> things.
>>>>>>>>>>>>> For example, we need to focus on JSON representation, but we
>>>>>>>>>>>>> don't need to deal with tree traversal or anything, so the code 
>>>>>>>>>>>>> for this
>>>>>>>>>>>>> could be pretty simple.
>>>>>>>>>>>>>
>>>>>>>>>>>>> >  Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>>> common across engines in the view definition (this is how 
>>>>>>>>>>>>> currently Netflix
>>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>>
>>>>>>>>>>>>> it's interesting. Anjali, do  you have means to enforce that,
>>>>>>>>>>>>> or is this just a convention?
>>>>>>>>>>>>>
>>>>>>>>>>>>> What are the common building blocks (relational operations,
>>>>>>>>>>>>> constructs and functions) that you found sufficient for 
>>>>>>>>>>>>> expressing your
>>>>>>>>>>>>> views?
>>>>>>>>>>>>> Being able to enumerate them could help validate various
>>>>>>>>>>>>> approaches considered here, including feasibility of dedicated
>>>>>>>>>>>>> representation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> PF
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Anjali,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am definitely happy to help with implementing 1-3 in your
>>>>>>>>>>>>>> first list once the spec has been approved by the community. My 
>>>>>>>>>>>>>> hope is
>>>>>>>>>>>>>> that the final version of the view spec will make it easy to
>>>>>>>>>>>>>> re-use existing rollback/time travel/metadata etc 
>>>>>>>>>>>>>> functionalities.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>>>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>>>>>>>>>>> desirable while storing Calcite makes it hard for eg python to 
>>>>>>>>>>>>>> use views. A
>>>>>>>>>>>>>> project to make a cross language and cross engine IR for sql 
>>>>>>>>>>>>>> views and the
>>>>>>>>>>>>>> relevant transpilers is imho outside the scope of this spec and 
>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr 
>>>>>>>>>>>>>> suggested
>>>>>>>>>>>>>> is possible but I feel it will likely quickly snowball into a 
>>>>>>>>>>>>>> larger
>>>>>>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I 
>>>>>>>>>>>>>> think the
>>>>>>>>>>>>>> most reasonable way forward is to add a dialect field and a 
>>>>>>>>>>>>>> warning to
>>>>>>>>>>>>>> engines that views are not (yet) cross compatible. This is at 
>>>>>>>>>>>>>> odds with the
>>>>>>>>>>>>>> original spirit of iceberg tables and I wonder how the border 
>>>>>>>>>>>>>> community
>>>>>>>>>>>>>> feels about it? I would hope that we can make the view spec 
>>>>>>>>>>>>>> engine-free
>>>>>>>>>>>>>> over time and eventually deprecate the dialect field.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PS if anyone is interested in collaborating on engine
>>>>>>>>>>>>>> agnostic views please reach out. I am keen on exploring this 
>>>>>>>>>>>>>> topic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I
>>>>>>>>>>>>>>> have and will continue to address them in the doc.
>>>>>>>>>>>>>>> Great to know about Trino views, Piotr!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks to everybody who has offered help with
>>>>>>>>>>>>>>> implementation. The spec as it is proposed in the doc has been 
>>>>>>>>>>>>>>> implemented
>>>>>>>>>>>>>>> and is in use at Netflix (currently on Iceberg 0.9). Once we 
>>>>>>>>>>>>>>> close the
>>>>>>>>>>>>>>> spec, we will rebase our code to Iceberg-0.12 and incorporate 
>>>>>>>>>>>>>>> changes to
>>>>>>>>>>>>>>> format and other feedback from the community and should be able 
>>>>>>>>>>>>>>> to make
>>>>>>>>>>>>>>> this MVP implementation available quickly as a PR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A few areas that we have not yet worked on and would love
>>>>>>>>>>>>>>> for the community to help are:
>>>>>>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a
>>>>>>>>>>>>>>> version or time
>>>>>>>>>>>>>>> 2. History table: A system table implementation for
>>>>>>>>>>>>>>> $versions similar to the $snapshots table in order to display 
>>>>>>>>>>>>>>> the history
>>>>>>>>>>>>>>> of a view
>>>>>>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a
>>>>>>>>>>>>>>> previous version
>>>>>>>>>>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One comment that is worth a broader discussion is the
>>>>>>>>>>>>>>> dialect of the SQL stored in the view metadata. The purpose of 
>>>>>>>>>>>>>>> the spec is
>>>>>>>>>>>>>>> to provide a storage format for view metadata and APIs to 
>>>>>>>>>>>>>>> access that
>>>>>>>>>>>>>>> metadata. The dialect of the SQL stored is an orthogonal 
>>>>>>>>>>>>>>> question and is
>>>>>>>>>>>>>>> outside the scope of this spec.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>>>>>>>>>>> suggestions that came up in the comments to continue the 
>>>>>>>>>>>>>>> discussion:
>>>>>>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly
>>>>>>>>>>>>>>> common across engines in the view definition (this is how 
>>>>>>>>>>>>>>> currently Netflix
>>>>>>>>>>>>>>> uses these 'common' views across Spark and Trino)
>>>>>>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect
>>>>>>>>>>>>>>> of the SQL. This allows for any desired dialect, but no improved
>>>>>>>>>>>>>>> cross-engine operability
>>>>>>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and
>>>>>>>>>>>>>>> translate back and forth between engine-supported SQL and AST
>>>>>>>>>>>>>>> 4. Intermediate structured language of our own. (What
>>>>>>>>>>>>>>> additional functionality does it provide over Calcite?)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Given that the view metadata is json, it is easily
>>>>>>>>>>>>>>> extendable to incorporate any new fields needed to make the SQL 
>>>>>>>>>>>>>>> truly
>>>>>>>>>>>>>>> compatible across engines.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>> Anjali
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>>>>>>>>>>> Of course, this is by no means usable by other query
>>>>>>>>>>>>>>>> engines.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Anjali, your document does not talk much about
>>>>>>>>>>>>>>>> compatibility between query engines.
>>>>>>>>>>>>>>>> How do you plan to address that?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its
>>>>>>>>>>>>>>>> powers for dealing with legacy stuff like views defined by 
>>>>>>>>>>>>>>>> Hive.
>>>>>>>>>>>>>>>> I treat it as a great technology supporting transitioning
>>>>>>>>>>>>>>>> from a query engine to a better one.
>>>>>>>>>>>>>>>> However, I would not base a design of some new system for
>>>>>>>>>>>>>>>> storing cross-engine compatible views on that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there something else we can use?
>>>>>>>>>>>>>>>> Maybe the view definition should use some
>>>>>>>>>>>>>>>> intermediate structured language that's not SQL?
>>>>>>>>>>>>>>>> For example, it could represent logical structure of
>>>>>>>>>>>>>>>> operations in semantics manner.
>>>>>>>>>>>>>>>> This would eliminate need for cross-engine compatible
>>>>>>>>>>>>>>>> parsing and analysis.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>>>> PF
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks Anjali!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have left some comments on the document. I unfortunately
>>>>>>>>>>>>>>>>> have to miss the community meetup tomorrow but would love to 
>>>>>>>>>>>>>>>>> chat more/help
>>>>>>>>>>>>>>>>> w/ implementation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec
>>>>>>>>>>>>>>>>>> for storing view metadata in Iceberg. The proposal has been 
>>>>>>>>>>>>>>>>>> implemented [1]
>>>>>>>>>>>>>>>>>> and is in production at Netflix for over 15 months.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please let us know your thoughts by adding comments to
>>>>>>>>>>>>>>>>>> the doc.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Anjali.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>> --
>>>>>> John Zhuge
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Proposal: Support for views in Iceberg

Reply via email to