Re: Proposal: Support for views in Iceberg

Anjali Norwood Thu, 29 Jul 2021 03:33:04 -0700

Hi,

Thank you for all the comments. I will try to address them all here
together.



   - @all Cross engine compatibility of view definition: Multiple options
   such as engine agnostic SQL or IR of some form have been mentioned. We can
   all agree that all of these options are non-trivial to design/implement
   (perhaps a multi-year effort based on the option chosen) and merit further
   discussion. I would like to suggest that we continue this discussion but
   target this work for the future (v2?). In v1, we can add an optional
   dialect field and an optional expanded/resolved SQL field that can be
   interpreted by engines as they see fit. V1 can unlock many use cases where
   the views are either accessed by a single engine or multi-engine use cases
   where a (common) subset of SQL is supported. This proposal allows for
   desirable features such as versioning of views and a common format of
   storing view metadata while allowing extensibility in the future. *Does
   anyone feel strongly otherwise?*
   - @Piotr  As for common views at Netflix, the restrictions on SQL are
   not enforced, but are advised as best practices. The convention of common
   SQL has been working for a majority of users. SQL features commonly used
   are column projections, simple filter application, joins, grouping and
   common aggregate and scalar function. A few users occasionally would like
   to use Trino or Spark specific functions but are sometimes able to find a
   way to use a function that is common to both the engines.
   - @Jacques and @Jack Iceberg data types are engine agnostic and hence
   were picked for storing view schema. Thinking further, the schema field
   should be made 'optional', since not all engines require it. (e.g. Spark
   does not need it and Trino uses it only for validation).
   - @Jacques Table references in the views can be arbitrary objects such
   as tables from other catalogs or elasticsearch tables etc. I will clarify
   it in the spec.

I will work on incorporating all the comments in the spec and make the next
revision available for review soon.

Regards,
Anjali.




On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen <[email protected]>
wrote:

> Hi,
>
> Thanks Jack and Jacques for sharing your thoughts.
>
> I agree that tracking  dialect/origin is better than nothing.
> I think having a Map {dialect: sql} is not going to buy us much.
> I.e. it would be useful if there was some external app (or a human being)
> that would write those alternative SQLs for each dialect.
> Otherwise I am not imagining Spark writing SQL for Spark and Trino, or
> Trino writing SQL for Trino and Spark.
>
> Thanks Jacques for a good summary of SQL supporting options.
> While i like the idea of starting with Trino SQL ANTLR grammar file (it's
> really well written and resembles spec quite well), you made a good point
> that grammar is only part of the problem. Coercions, function resolution,
> dereference resolution, table resolution are part of query analysis that
> goes beyond just grammar.
> In fact, column scoping rules -- while clearly defined by the spec -- may
> easily differ between engines (pretty usual).
> That's why i would rather lean towards some intermediate representation
> that is *not *SQL, doesn't require parsing (is already structural), nor
> analysis (no scopes! no implicit coercions!).
> Before we embark on such a journey, it would be interesting to hear @Martin
> Traverso <[email protected]> 's thoughts on feasibility though.
>
>
> Best,
> PF
>
>
>
>
> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau <[email protected]>
> wrote:
>
>> Some thoughts...
>>
>>    - In general, many engines want (or may require) a resolved sql
>>    field. This--at minimum--typically includes star expansion since
>>    traditional view behavior is stars are expanded at view creation time
>>    (since this is the only way to guarantee that the view returns the same
>>    logical definition even if the underlying table changes). This may also
>>    include a replacement of relative object names to absolute object names
>>    based on the session catalog & namespace. If I recall correctly, Hive does
>>    both of these things.
>>    - It isn't clear in the spec whether the table references used in
>>    views are restricted to other Iceberg objects or can be arbitrary objects
>>    in the context of a particular engine. Maybe I missed this? For example,
>>    can I have a Trino engine view that references an Elasticsearch table
>>    stored in an Iceberg view?
>>    - Restricting schemas to the Iceberg types will likely lead to
>>    unintended consequences. I appreciate the attraction to it but I think it
>>    would either create artificial barriers around the types of SQL that are
>>    allowed and/or mean that replacing a CTE with a view could potentially
>>    change the behavior of the query which I believe violates most typical
>>    engine behaviors. A good example of this is the simple sql statement of
>>    "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by
>>    default I believe), c2 will be specified as a CHAR(3). In this Iceberg
>>    context, is this view disallowed? If it isn't disallowed then you have an
>>    issue where the view schema will be required to be different from a CTE
>>    since the engine will resolve it differently than Iceberg. Even if you
>>    ignore CHAR(X), you've still got VARCHAR(X) to contend with...
>>    - It is important to remember that Calcite is a set of libraries and
>>    not a specification. There are things that can be specified in Calcite but
>>    in general it doesn't have formal specification as a first principle. It 
>> is
>>    more implementation as a first principle. This is in contrast to projects
>>    like Arrow and Iceberg, which start with well-formed specifications. I've
>>    been working with Calcite since before it was an Apache project and I
>>    wouldn't recommend adopting it as any form of a specification. On the
>>    flipside, I am very supportive of using it for a reference implementation
>>    standard for Iceberg view consumption, manipulation, etc.  If anything, 
>> I'd
>>    suggest we start with the adoption of a relatively clear grammar, e.g. the
>>    Antlr grammar file that Spark [1] and/or Trino [2] use. Even that is not a
>>    complete specification as grammar must still be interpreted with regards 
>> to
>>    type promotion, function resolution, consistent unnamed expression naming,
>>    etc that aren't defined at the grammar level. I'd definitely avoid using
>>    Calcite's JavaCC grammar as it heavily embeds implementation details (in a
>>    good way) and relies on some fairly complex logic in the validator and
>>    sql2rel components to be fully resolved/comprehended.
>>
>> Given the above, I suggest having a field which describes the
>> dialect(origin?) of the view and then each engine can decide how they want
>> to consume/mutate that view (and whether they want to or not). It does risk
>> being a dumping ground. Nonetheless, I'd expect the alternative of
>> establishing a formal SQL specification to be a similarly long process to
>> the couple of years it took to build the Arrow and Iceberg specifications.
>> (Realistically, there is far more to specify here than there is in either
>> of those two domains.)
>>
>> Some other notes:
>>
>>    - Calcite does provide a nice reference document [3] but it is not
>>    sufficient to implement what is necessary for parsing/validating/resolving
>>    a SQL string correctly/consistently.
>>    - Projects like Coral [4] are interesting here but even Coral is
>>    based roughly on "HiveQL" which also doesn't have a formal specification
>>    process outside of the Hive version you're running. See this thread in
>>    Coral slack [5]
>>    - ZetaSQL [6] also seems interesting in this space. It feels closer
>>    to specification based [7] than Calcite but is much less popular in the 
>> big
>>    data domain. I also haven't reviewed it's SQL completeness closely, a
>>    strength of Calcite.
>>    - One of the other problems with building against an implementation
>>    as opposed to a specification (e.g. Calcite) is it can make it difficult 
>> or
>>    near impossible to implement the same algorithms again without a bunch of
>>    reverse engineering. If interested in an example of this, see the
>>    discussion behind LZ4 deprecation on the Parquet spec [8] for how painful
>>    this kind of mistake can become.
>>    - I'd love to use the SQL specification itself but nobody actually
>>    implements that in its entirety and it has far too many places where 
>> things
>>    are "implementation-defined" [9].
>>
>> [1]
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>> [2]
>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4
>> [3] https://calcite.apache.org/docs/reference.html
>> [4] https://github.com/linkedin/coral
>> [5] https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000
>> [6] https://github.com/google/zetasql
>> [7] https://github.com/google/zetasql/blob/master/docs/one-pager.md
>> [8]
>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4
>> [9] https://twitter.com/sc13ts/status/1413728808830525440
>>
>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <[email protected]> wrote:
>>
>>> Did not notice that we are also discussing cross-engine interoperability
>>> here, I will add my response in the design doc here.
>>>
>>> I would personally prefer cross-engine interoperability as a goal and
>>> get the spec in the right structure in the initial release, because:
>>>
>>> 1. I believe that cross-engine compatibility is a critical feature of
>>> Iceberg. If I am a user of an existing data lake that already supports
>>> views (e.g. Hive), I don't even need Iceberg to have this view feature. I
>>> can do what is now done for Trino to use views with Iceberg. I can also
>>> just use a table property to indicate the table is a view and store the
>>> view SQL as a table property and do my own thing in any query engine to
>>> support all the view features. One of the most valuable and unique features
>>> that Iceberg view can unlock is to allow a view to be created in one engine
>>> and read by another. Not supporting cross-engine compatibility feels like
>>> losing a lot of value to me.
>>>
>>> 2. In the view definition, it feels inconsistent to me that we have
>>> "schema" as an Iceberg native schema, but "sql" field as the view SQL that
>>> can come from any query engine. If the engine already needs to convert the
>>> view schema to iceberg shema, it should just do the same for the view SQL.
>>>
>>> Regarding the way to achieve it, I think it comes to either Apache
>>> Calcite (or some other third party alternative I don't know) or our own
>>> implementation of some intermediate representation. I don't have a very
>>> strong opinion, but my thoughts are the following:
>>>
>>> 1. Calcite is supposed to be the go-to software to deal with this kind
>>> of issue, but my personal concern is that the integration is definitely
>>> going to be much more involved, and it will become another barrier for
>>> newer engines to onboard because it not only needs to implement Iceberg
>>> APIs but also needs Calcite support. It will also start to become a
>>> constant discussion around what we maintain and what we should push to
>>> Calcite, similar to our situation today with Spark.
>>>
>>> 2. Another way I am leaning towards, as Piotr also suggested, is to have
>>> a native lightweight logical query structure representation of the view SQL
>>> and store that instead of the SQL string. We already deal with Expressions
>>> in Iceberg, and engines have to convert predicates to Iceberg expressions
>>> for predicate pushdown. So I think it would not be hard to extend on that
>>> to support this use case. Different engines can build this logical
>>> structure when traversing their own AST during a create view query.
>>>
>>> 3. With these considerations, I think the "sql" field can potentially be
>>> a map (maybe called "engine-sqls"?), where key is the engine type and
>>> version like "Spark 3.1", and value is the view SQL string. In this way,
>>> the engine that creates the view can still read the SQL directly which
>>> might lead to better engine-native integration and avoid redundant parsing.
>>> But in this approach there is always a default intermediate representation
>>> it can fallback to when the engine's key is not found in the map. If we
>>> want to make incremental progress and delay the design for the intermediate
>>> representation, I think we should at least use this map instead of just a
>>> single string.
>>>
>>> Thanks,
>>> Jack Ye
>>>
>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> First of all thank you for this discussion and all the view-related
>>>> work!
>>>>
>>>> I agree that solving cross-engine compatibility problem may not be
>>>> primary feature today, I am concerned that not thinking about this from the
>>>> start may "tunnel" us into a wrong direction.
>>>> Cross-engine compatible views would be such a cool feature that it is
>>>> hard to just let it pass.
>>>>
>>>> My thinking about a smaller IR may be a side-product of me not being
>>>> familiar enough with Calcite.
>>>> However, with new IR being focused on compatible representation, and
>>>> not being tied to anything are actually good things.
>>>> For example, we need to focus on JSON representation, but we don't need
>>>> to deal with tree traversal or anything, so the code for this could be
>>>> pretty simple.
>>>>
>>>> >  Allow only ANSI-compliant SQL and anything that is truly common
>>>> across engines in the view definition (this is how currently Netflix uses
>>>> these 'common' views across Spark and Trino)
>>>>
>>>> it's interesting. Anjali, do  you have means to enforce that, or is
>>>> this just a convention?
>>>>
>>>> What are the common building blocks (relational operations, constructs
>>>> and functions) that you found sufficient for expressing your views?
>>>> Being able to enumerate them could help validate various approaches
>>>> considered here, including feasibility of dedicated representation.
>>>>
>>>>
>>>> Best,
>>>> PF
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <[email protected]> wrote:
>>>>
>>>>> Hey Anjali,
>>>>>
>>>>> I am definitely happy to help with implementing 1-3 in your first list
>>>>> once the spec has been approved by the community. My hope is that the 
>>>>> final
>>>>> version of the view spec will make it easy to re-use existing 
>>>>> rollback/time
>>>>> travel/metadata etc functionalities.
>>>>>
>>>>> Regarding SQL dialects.My personal opinion is: Enforcing
>>>>> ANSI-compliant SQL across all engines is hard and probably not
>>>>> desirable while storing Calcite makes it hard for eg python to use views. 
>>>>> A
>>>>> project to make a cross language and cross engine IR for sql views and the
>>>>> relevant transpilers is imho outside the scope of this spec and probably
>>>>> deserving an apache project of its own. A smaller IR like Piotr suggested
>>>>> is possible but I feel it will likely quickly snowball into a larger
>>>>> project and slow down adoption of the view spec in iceberg. So I think the
>>>>> most reasonable way forward is to add a dialect field and a warning to
>>>>> engines that views are not (yet) cross compatible. This is at odds with 
>>>>> the
>>>>> original spirit of iceberg tables and I wonder how the border community
>>>>> feels about it? I would hope that we can make the view spec engine-free
>>>>> over time and eventually deprecate the dialect field.
>>>>>
>>>>> Best,
>>>>> Ryan
>>>>>
>>>>> PS if anyone is interested in collaborating on engine agnostic views
>>>>> please reach out. I am keen on exploring this topic.
>>>>>
>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I have and
>>>>>> will continue to address them in the doc.
>>>>>> Great to know about Trino views, Piotr!
>>>>>>
>>>>>> Thanks to everybody who has offered help with implementation. The
>>>>>> spec as it is proposed in the doc has been implemented and is in use at
>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will 
>>>>>> rebase
>>>>>> our code to Iceberg-0.12 and incorporate changes to format and
>>>>>> other feedback from the community and should be able to make this MVP
>>>>>> implementation available quickly as a PR.
>>>>>>
>>>>>> A few areas that we have not yet worked on and would love for the
>>>>>> community to help are:
>>>>>> 1. Time travel on views: Be able to access the view as of a version
>>>>>> or time
>>>>>> 2. History table: A system table implementation for $versions similar
>>>>>> to the $snapshots table in order to display the history of a view
>>>>>> 3. Rollback to a version: A way to rollback a view to a previous
>>>>>> version
>>>>>> 4. Engine agnostic SQL: more below.
>>>>>>
>>>>>> One comment that is worth a broader discussion is the dialect of the
>>>>>> SQL stored in the view metadata. The purpose of the spec is to provide a
>>>>>> storage format for view metadata and APIs to access that metadata. The
>>>>>> dialect of the SQL stored is an orthogonal question and is outside the
>>>>>> scope of this spec.
>>>>>>
>>>>>> Nonetheless, it is an important concern, so compiling a few
>>>>>> suggestions that came up in the comments to continue the discussion:
>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly common
>>>>>> across engines in the view definition (this is how currently Netflix uses
>>>>>> these 'common' views across Spark and Trino)
>>>>>> 2. Add a field to the view metadata to identify the dialect of the
>>>>>> SQL. This allows for any desired dialect, but no improved cross-engine
>>>>>> operability
>>>>>> 3. Store AST produced by Calcite in the view metadata and translate
>>>>>> back and forth between engine-supported SQL and AST
>>>>>> 4. Intermediate structured language of our own. (What additional
>>>>>> functionality does it provide over Calcite?)
>>>>>>
>>>>>> Given that the view metadata is json, it is easily extendable to
>>>>>> incorporate any new fields needed to make the SQL truly compatible across
>>>>>> engines.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> regards,
>>>>>> Anjali
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> FWIW, in Trino we just added Trino views support.
>>>>>>> https://github.com/trinodb/trino/pull/8540
>>>>>>> Of course, this is by no means usable by other query engines.
>>>>>>>
>>>>>>> Anjali, your document does not talk much about compatibility between
>>>>>>> query engines.
>>>>>>> How do you plan to address that?
>>>>>>>
>>>>>>> For example, I am familiar with Coral, and I appreciate its powers
>>>>>>> for dealing with legacy stuff like views defined by Hive.
>>>>>>> I treat it as a great technology supporting transitioning from a
>>>>>>> query engine to a better one.
>>>>>>> However, I would not base a design of some new system for storing
>>>>>>> cross-engine compatible views on that.
>>>>>>>
>>>>>>> Is there something else we can use?
>>>>>>> Maybe the view definition should use some intermediate structured
>>>>>>> language that's not SQL?
>>>>>>> For example, it could represent logical structure of operations in
>>>>>>> semantics manner.
>>>>>>> This would eliminate need for cross-engine compatible parsing and
>>>>>>> analysis.
>>>>>>>
>>>>>>> Best
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Anjali!
>>>>>>>>
>>>>>>>> I have left some comments on the document. I unfortunately have to
>>>>>>>> miss the community meetup tomorrow but would love to chat more/help w/
>>>>>>>> implementation.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> John Zhuge and I would like to propose the following spec for
>>>>>>>>> storing view metadata in Iceberg. The proposal has been implemented 
>>>>>>>>> [1] and
>>>>>>>>> is in production at Netflix for over 15 months.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view
>>>>>>>>>
>>>>>>>>> Please let us know your thoughts by adding comments to the doc.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Anjali.
>>>>>>>>>
>>>>>>>>

Re: Proposal: Support for views in Iceberg

Reply via email to