Hello All, I answered all the comments and made changes to the spec to reflect them .. the doc now shows the new revision (link here: https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit# ). Please diff with the version named 'Spec v1' in order to see the deltas. Please let me know if any of your comments are not satisfactorily addressed.
@Ryan, Thank you for the insightful feedback, especially around the use of Iceberg data types for schema and possible evolution. I agree with your comment: *"I think we should move forward with the proposal as it is, and pursue type annotations and possibly IR in parallel."* How do we achieve consensus/farm for dissent and move forward with the proposal? thanks, Anjali. On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jzh...@apache.org> wrote: > +1 > > On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <b...@tabular.io> wrote: > >> Thanks for working on this, Anjali! It’s great to see the thorough >> discussion here with everyone. >> >> For the discussion about SQL dialect, I think that the right first step >> is to capture the SQL or query dialect. That will give us the most >> flexibility. Engines that can use Coral for translation can attempt to >> convert and engines that don’t can see if the SQL is valid and can be used. >> >> I think that the idea to create a minimal IR is an interesting one, but >> can be added later. We will always need to record the SQL and dialect, even >> if we translate to IR because users are going to configure views using SQL. >> Uses like showing view history or debugging need to show the original SQL, >> plus relevant information like where it was created and the SQL dialect. We >> should be able to add this later by adding additional metadata to the view >> definition. I don’t think that it would introduce breaking changes to add a >> common representation that can be optionally consumed. >> >> Let’s continue talking about a minimal IR, separately. View translation >> is a hard problem. Right now, to get views across engines we have to >> compromise confidence. IR is a way to have strong confidence, but with >> limited expressibility. I think that’s a good trade in a lot of cases and >> is worth pursuing, even if it will take a long time. >> >> Jacques makes a great point about types, but I think that the right >> option here is to continue using Iceberg types. We’ve already had >> discussions about whether Iceberg should support annotating types with >> engine-specific ones, so we have a reasonable way to improve this while >> also providing compatibility across engines: char(n) is not necessarily >> supported everywhere and mapping it to string will make sense in most >> places. The schema is primarily used to validate that the data produced by >> the query hasn’t changed and that is more about the number of columns in >> structs and the names of fields rather than exact types. We can fix up >> types when substituting without losing too much: if the SQL produces a >> varchar(10) field that the view metadata says is a string, then it’s >> okay that it is varchar(10). There is some loss in that we don’t know if >> it was originally varchar(5), but I think that this is not going to >> cause too many issues. Not all engines will even validate that the schema >> has not changed, since it could be valid to use select * from x where ... >> and allow new fields to appear. >> >> Right now, I think we should move forward with the proposal as it is, and >> pursue type annotations and possibly IR in parallel. Does that sound >> reasonable to everyone? >> >> Ryan >> >> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen <pi...@starburstdata.com> >> wrote: >> >>> Hi Anjali, >>> >>> That's a nice summary. >>> >>> re dialect field. It shouldn't be a bit trouble to have it (or any other >>> way to identify application that created the view), and it might be useful. >>> Why not make it required from the start? >>> >>> re "expanded/resolved SQL" -- i don't understand yet what we would put >>> there, so cannot comment. >>> >>> I agree there it's nice to get something out of the door, and I see how >>> the current proposal fits some needs already. >>> However, i am concerned about the proliferation of non-cross-engine >>> compatible views, if we do that. >>> >>> Also, if we later agree on any compatible approach (portable subset of >>> SQL, engine-agnostic IR, etc.), then from the perspective of each engine, >>> it would be a breaking change. >>> Unless we make the compatible approach as expressive as full power of >>> SQL, some views that are possible to create in v1 will not be possible to >>> create in v2. >>> Thus, if v1 is "some SQL" and v2 is "something awesomely compatible", >>> we may not be able to roll it out. >>> >>> > the convention of common SQL has been working for a majority of users. >>> SQL features commonly used are column projections, simple filter >>> application, joins, grouping and common aggregate and scalar function. A >>> few users occasionally would like to use Trino or Spark specific functions >>> but are sometimes able to find a way to use a function that is common to >>> both the engines. >>> >>> >>> it's an awesome summary of what constructs are necessary to be able to >>> define useful views, while also keep them portable. >>> >>> To be able to express column projections, simple filter application, >>> joins, grouping and common aggregate and scalar function in a structured >>> IR, how much effort do you think would be required? >>> We didn't really talk about downsides of a structured approach, other >>> than it looks complex. >>> if we indeed estimate it as a multi-year effort, i wouldn't argue for >>> that. Maybe i were overly optimistic though. >>> >>> >>> As Jack mentioned, for engine-specific approach that's not supposed to >>> be consumed by multiple engines, we may be better served with approach >>> that's outside of Iceberg spec, like >>> https://github.com/trinodb/trino/pull/8540. >>> >>> >>> Best, >>> PF >>> >>> >>> >>> >>> >>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood >>> <anorw...@netflix.com.invalid> wrote: >>> >>>> Hi, >>>> >>>> Thank you for all the comments. I will try to address them all here >>>> together. >>>> >>>> >>>> - @all Cross engine compatibility of view definition: Multiple >>>> options such as engine agnostic SQL or IR of some form have been >>>> mentioned. >>>> We can all agree that all of these options are non-trivial to >>>> design/implement (perhaps a multi-year effort based on the option >>>> chosen) >>>> and merit further discussion. I would like to suggest that we continue >>>> this >>>> discussion but target this work for the future (v2?). In v1, we can add >>>> an >>>> optional dialect field and an optional expanded/resolved SQL field that >>>> can >>>> be interpreted by engines as they see fit. V1 can unlock many use cases >>>> where the views are either accessed by a single engine or multi-engine >>>> use >>>> cases where a (common) subset of SQL is supported. This proposal allows >>>> for >>>> desirable features such as versioning of views and a common format of >>>> storing view metadata while allowing extensibility in the future. *Does >>>> anyone feel strongly otherwise?* >>>> - @Piotr As for common views at Netflix, the restrictions on SQL >>>> are not enforced, but are advised as best practices. The convention of >>>> common SQL has been working for a majority of users. SQL features >>>> commonly >>>> used are column projections, simple filter application, joins, grouping >>>> and >>>> common aggregate and scalar function. A few users occasionally would >>>> like >>>> to use Trino or Spark specific functions but are sometimes able to find >>>> a >>>> way to use a function that is common to both the engines. >>>> - @Jacques and @Jack Iceberg data types are engine agnostic and >>>> hence were picked for storing view schema. Thinking further, the schema >>>> field should be made 'optional', since not all engines require it. (e.g. >>>> Spark does not need it and Trino uses it only for validation). >>>> - @Jacques Table references in the views can be arbitrary objects >>>> such as tables from other catalogs or elasticsearch tables etc. I will >>>> clarify it in the spec. >>>> >>>> I will work on incorporating all the comments in the spec and make the >>>> next revision available for review soon. >>>> >>>> Regards, >>>> Anjali. >>>> >>>> >>>> >>>> >>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen < >>>> pi...@starburstdata.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> Thanks Jack and Jacques for sharing your thoughts. >>>>> >>>>> I agree that tracking dialect/origin is better than nothing. >>>>> I think having a Map {dialect: sql} is not going to buy us much. >>>>> I.e. it would be useful if there was some external app (or a human >>>>> being) that would write those alternative SQLs for each dialect. >>>>> Otherwise I am not imagining Spark writing SQL for Spark and Trino, or >>>>> Trino writing SQL for Trino and Spark. >>>>> >>>>> Thanks Jacques for a good summary of SQL supporting options. >>>>> While i like the idea of starting with Trino SQL ANTLR grammar file >>>>> (it's really well written and resembles spec quite well), you made a good >>>>> point that grammar is only part of the problem. Coercions, function >>>>> resolution, dereference resolution, table resolution are part of query >>>>> analysis that goes beyond just grammar. >>>>> In fact, column scoping rules -- while clearly defined by the spec -- >>>>> may easily differ between engines (pretty usual). >>>>> That's why i would rather lean towards some intermediate >>>>> representation that is *not *SQL, doesn't require parsing (is already >>>>> structural), nor analysis (no scopes! no implicit coercions!). >>>>> Before we embark on such a journey, it would be interesting to hear >>>>> @Martin >>>>> Traverso <mar...@starburstdata.com> 's thoughts on feasibility though. >>>>> >>>>> >>>>> Best, >>>>> PF >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau < >>>>> jacquesnad...@gmail.com> wrote: >>>>> >>>>>> Some thoughts... >>>>>> >>>>>> - In general, many engines want (or may require) a resolved sql >>>>>> field. This--at minimum--typically includes star expansion since >>>>>> traditional view behavior is stars are expanded at view creation time >>>>>> (since this is the only way to guarantee that the view returns the >>>>>> same >>>>>> logical definition even if the underlying table changes). This may >>>>>> also >>>>>> include a replacement of relative object names to absolute object >>>>>> names >>>>>> based on the session catalog & namespace. If I recall correctly, Hive >>>>>> does >>>>>> both of these things. >>>>>> - It isn't clear in the spec whether the table references used in >>>>>> views are restricted to other Iceberg objects or can be arbitrary >>>>>> objects >>>>>> in the context of a particular engine. Maybe I missed this? For >>>>>> example, >>>>>> can I have a Trino engine view that references an Elasticsearch table >>>>>> stored in an Iceberg view? >>>>>> - Restricting schemas to the Iceberg types will likely lead to >>>>>> unintended consequences. I appreciate the attraction to it but I >>>>>> think it >>>>>> would either create artificial barriers around the types of SQL that >>>>>> are >>>>>> allowed and/or mean that replacing a CTE with a view could potentially >>>>>> change the behavior of the query which I believe violates most typical >>>>>> engine behaviors. A good example of this is the simple sql statement >>>>>> of >>>>>> "SELECT c1, 'foo' as c2 from table1". In many engines (and Calcite by >>>>>> default I believe), c2 will be specified as a CHAR(3). In this Iceberg >>>>>> context, is this view disallowed? If it isn't disallowed then you >>>>>> have an >>>>>> issue where the view schema will be required to be different from a >>>>>> CTE >>>>>> since the engine will resolve it differently than Iceberg. Even if you >>>>>> ignore CHAR(X), you've still got VARCHAR(X) to contend with... >>>>>> - It is important to remember that Calcite is a set of libraries >>>>>> and not a specification. There are things that can be specified in >>>>>> Calcite >>>>>> but in general it doesn't have formal specification as a first >>>>>> principle. >>>>>> It is more implementation as a first principle. This is in contrast to >>>>>> projects like Arrow and Iceberg, which start with well-formed >>>>>> specifications. I've been working with Calcite since before it was an >>>>>> Apache project and I wouldn't recommend adopting it as any form of a >>>>>> specification. On the flipside, I am very supportive of using it for a >>>>>> reference implementation standard for Iceberg view consumption, >>>>>> manipulation, etc. If anything, I'd suggest we start with the >>>>>> adoption of >>>>>> a relatively clear grammar, e.g. the Antlr grammar file that Spark [1] >>>>>> and/or Trino [2] use. Even that is not a complete specification as >>>>>> grammar >>>>>> must still be interpreted with regards to type promotion, function >>>>>> resolution, consistent unnamed expression naming, etc that aren't >>>>>> defined >>>>>> at the grammar level. I'd definitely avoid using Calcite's JavaCC >>>>>> grammar >>>>>> as it heavily embeds implementation details (in a good way) and >>>>>> relies on >>>>>> some fairly complex logic in the validator and sql2rel components to >>>>>> be >>>>>> fully resolved/comprehended. >>>>>> >>>>>> Given the above, I suggest having a field which describes the >>>>>> dialect(origin?) of the view and then each engine can decide how they >>>>>> want >>>>>> to consume/mutate that view (and whether they want to or not). It does >>>>>> risk >>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of >>>>>> establishing a formal SQL specification to be a similarly long process to >>>>>> the couple of years it took to build the Arrow and Iceberg >>>>>> specifications. >>>>>> (Realistically, there is far more to specify here than there is in either >>>>>> of those two domains.) >>>>>> >>>>>> Some other notes: >>>>>> >>>>>> - Calcite does provide a nice reference document [3] but it is >>>>>> not sufficient to implement what is necessary for >>>>>> parsing/validating/resolving a SQL string correctly/consistently. >>>>>> - Projects like Coral [4] are interesting here but even Coral is >>>>>> based roughly on "HiveQL" which also doesn't have a formal >>>>>> specification >>>>>> process outside of the Hive version you're running. See this thread in >>>>>> Coral slack [5] >>>>>> - ZetaSQL [6] also seems interesting in this space. It feels >>>>>> closer to specification based [7] than Calcite but is much less >>>>>> popular in >>>>>> the big data domain. I also haven't reviewed it's SQL completeness >>>>>> closely, >>>>>> a strength of Calcite. >>>>>> - One of the other problems with building against an >>>>>> implementation as opposed to a specification (e.g. Calcite) is it can >>>>>> make >>>>>> it difficult or near impossible to implement the same algorithms again >>>>>> without a bunch of reverse engineering. If interested in an example of >>>>>> this, see the discussion behind LZ4 deprecation on the Parquet spec >>>>>> [8] for >>>>>> how painful this kind of mistake can become. >>>>>> - I'd love to use the SQL specification itself but nobody >>>>>> actually implements that in its entirety and it has far too many >>>>>> places >>>>>> where things are "implementation-defined" [9]. >>>>>> >>>>>> [1] >>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>> [2] >>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4 >>>>>> [3] https://calcite.apache.org/docs/reference.html >>>>>> [4] https://github.com/linkedin/coral >>>>>> [5] >>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000 >>>>>> [6] https://github.com/google/zetasql >>>>>> [7] https://github.com/google/zetasql/blob/master/docs/one-pager.md >>>>>> [8] >>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4 >>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440 >>>>>> >>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> Did not notice that we are also discussing cross-engine >>>>>>> interoperability here, I will add my response in the design doc here. >>>>>>> >>>>>>> I would personally prefer cross-engine interoperability as a goal >>>>>>> and get the spec in the right structure in the initial release, because: >>>>>>> >>>>>>> 1. I believe that cross-engine compatibility is a critical feature >>>>>>> of Iceberg. If I am a user of an existing data lake that already >>>>>>> supports >>>>>>> views (e.g. Hive), I don't even need Iceberg to have this view feature. >>>>>>> I >>>>>>> can do what is now done for Trino to use views with Iceberg. I can also >>>>>>> just use a table property to indicate the table is a view and store the >>>>>>> view SQL as a table property and do my own thing in any query engine to >>>>>>> support all the view features. One of the most valuable and unique >>>>>>> features >>>>>>> that Iceberg view can unlock is to allow a view to be created in one >>>>>>> engine >>>>>>> and read by another. Not supporting cross-engine compatibility feels >>>>>>> like >>>>>>> losing a lot of value to me. >>>>>>> >>>>>>> 2. In the view definition, it feels inconsistent to me that we have >>>>>>> "schema" as an Iceberg native schema, but "sql" field as the view SQL >>>>>>> that >>>>>>> can come from any query engine. If the engine already needs to convert >>>>>>> the >>>>>>> view schema to iceberg shema, it should just do the same for the view >>>>>>> SQL. >>>>>>> >>>>>>> Regarding the way to achieve it, I think it comes to either Apache >>>>>>> Calcite (or some other third party alternative I don't know) or our own >>>>>>> implementation of some intermediate representation. I don't have a very >>>>>>> strong opinion, but my thoughts are the following: >>>>>>> >>>>>>> 1. Calcite is supposed to be the go-to software to deal with this >>>>>>> kind of issue, but my personal concern is that the integration is >>>>>>> definitely going to be much more involved, and it will become another >>>>>>> barrier for newer engines to onboard because it not only needs to >>>>>>> implement >>>>>>> Iceberg APIs but also needs Calcite support. It will also start to >>>>>>> become a >>>>>>> constant discussion around what we maintain and what we should push to >>>>>>> Calcite, similar to our situation today with Spark. >>>>>>> >>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is to >>>>>>> have a native lightweight logical query structure representation of the >>>>>>> view SQL and store that instead of the SQL string. We already deal with >>>>>>> Expressions in Iceberg, and engines have to convert predicates to >>>>>>> Iceberg >>>>>>> expressions for predicate pushdown. So I think it would not be hard to >>>>>>> extend on that to support this use case. Different engines can build >>>>>>> this >>>>>>> logical structure when traversing their own AST during a create view >>>>>>> query. >>>>>>> >>>>>>> 3. With these considerations, I think the "sql" field can >>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is the >>>>>>> engine >>>>>>> type and version like "Spark 3.1", and value is the view SQL string. In >>>>>>> this way, the engine that creates the view can still read the SQL >>>>>>> directly >>>>>>> which might lead to better engine-native integration and avoid redundant >>>>>>> parsing. But in this approach there is always a default intermediate >>>>>>> representation it can fallback to when the engine's key is not found in >>>>>>> the >>>>>>> map. If we want to make incremental progress and delay the design for >>>>>>> the >>>>>>> intermediate representation, I think we should at least use this map >>>>>>> instead of just a single string. >>>>>>> >>>>>>> Thanks, >>>>>>> Jack Ye >>>>>>> >>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen < >>>>>>> pi...@starburstdata.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> First of all thank you for this discussion and all the view-related >>>>>>>> work! >>>>>>>> >>>>>>>> I agree that solving cross-engine compatibility problem may not be >>>>>>>> primary feature today, I am concerned that not thinking about this >>>>>>>> from the >>>>>>>> start may "tunnel" us into a wrong direction. >>>>>>>> Cross-engine compatible views would be such a cool feature that it >>>>>>>> is hard to just let it pass. >>>>>>>> >>>>>>>> My thinking about a smaller IR may be a side-product of me not >>>>>>>> being familiar enough with Calcite. >>>>>>>> However, with new IR being focused on compatible representation, >>>>>>>> and not being tied to anything are actually good things. >>>>>>>> For example, we need to focus on JSON representation, but we don't >>>>>>>> need to deal with tree traversal or anything, so the code for this >>>>>>>> could be >>>>>>>> pretty simple. >>>>>>>> >>>>>>>> > Allow only ANSI-compliant SQL and anything that is truly common >>>>>>>> across engines in the view definition (this is how currently Netflix >>>>>>>> uses >>>>>>>> these 'common' views across Spark and Trino) >>>>>>>> >>>>>>>> it's interesting. Anjali, do you have means to enforce that, or is >>>>>>>> this just a convention? >>>>>>>> >>>>>>>> What are the common building blocks (relational operations, >>>>>>>> constructs and functions) that you found sufficient for expressing your >>>>>>>> views? >>>>>>>> Being able to enumerate them could help validate various approaches >>>>>>>> considered here, including feasibility of dedicated representation. >>>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> PF >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <rym...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hey Anjali, >>>>>>>>> >>>>>>>>> I am definitely happy to help with implementing 1-3 in your first >>>>>>>>> list once the spec has been approved by the community. My hope is >>>>>>>>> that the >>>>>>>>> final version of the view spec will make it easy to re-use existing >>>>>>>>> rollback/time travel/metadata etc functionalities. >>>>>>>>> >>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing >>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not >>>>>>>>> desirable while storing Calcite makes it hard for eg python to use >>>>>>>>> views. A >>>>>>>>> project to make a cross language and cross engine IR for sql views >>>>>>>>> and the >>>>>>>>> relevant transpilers is imho outside the scope of this spec and >>>>>>>>> probably >>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr >>>>>>>>> suggested >>>>>>>>> is possible but I feel it will likely quickly snowball into a larger >>>>>>>>> project and slow down adoption of the view spec in iceberg. So I >>>>>>>>> think the >>>>>>>>> most reasonable way forward is to add a dialect field and a warning to >>>>>>>>> engines that views are not (yet) cross compatible. This is at odds >>>>>>>>> with the >>>>>>>>> original spirit of iceberg tables and I wonder how the border >>>>>>>>> community >>>>>>>>> feels about it? I would hope that we can make the view spec >>>>>>>>> engine-free >>>>>>>>> over time and eventually deprecate the dialect field. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>> PS if anyone is interested in collaborating on engine agnostic >>>>>>>>> views please reach out. I am keen on exploring this topic. >>>>>>>>> >>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood >>>>>>>>> <anorw...@netflix.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I have >>>>>>>>>> and will continue to address them in the doc. >>>>>>>>>> Great to know about Trino views, Piotr! >>>>>>>>>> >>>>>>>>>> Thanks to everybody who has offered help with implementation. The >>>>>>>>>> spec as it is proposed in the doc has been implemented and is in use >>>>>>>>>> at >>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we will >>>>>>>>>> rebase >>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and >>>>>>>>>> other feedback from the community and should be able to make this MVP >>>>>>>>>> implementation available quickly as a PR. >>>>>>>>>> >>>>>>>>>> A few areas that we have not yet worked on and would love for the >>>>>>>>>> community to help are: >>>>>>>>>> 1. Time travel on views: Be able to access the view as of a >>>>>>>>>> version or time >>>>>>>>>> 2. History table: A system table implementation for $versions >>>>>>>>>> similar to the $snapshots table in order to display the history of a >>>>>>>>>> view >>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a previous >>>>>>>>>> version >>>>>>>>>> 4. Engine agnostic SQL: more below. >>>>>>>>>> >>>>>>>>>> One comment that is worth a broader discussion is the dialect of >>>>>>>>>> the SQL stored in the view metadata. The purpose of the spec is to >>>>>>>>>> provide >>>>>>>>>> a storage format for view metadata and APIs to access that metadata. >>>>>>>>>> The >>>>>>>>>> dialect of the SQL stored is an orthogonal question and is outside >>>>>>>>>> the >>>>>>>>>> scope of this spec. >>>>>>>>>> >>>>>>>>>> Nonetheless, it is an important concern, so compiling a few >>>>>>>>>> suggestions that came up in the comments to continue the discussion: >>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly >>>>>>>>>> common across engines in the view definition (this is how currently >>>>>>>>>> Netflix >>>>>>>>>> uses these 'common' views across Spark and Trino) >>>>>>>>>> 2. Add a field to the view metadata to identify the dialect of >>>>>>>>>> the SQL. This allows for any desired dialect, but no improved >>>>>>>>>> cross-engine >>>>>>>>>> operability >>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and >>>>>>>>>> translate back and forth between engine-supported SQL and AST >>>>>>>>>> 4. Intermediate structured language of our own. (What additional >>>>>>>>>> functionality does it provide over Calcite?) >>>>>>>>>> >>>>>>>>>> Given that the view metadata is json, it is easily extendable to >>>>>>>>>> incorporate any new fields needed to make the SQL truly compatible >>>>>>>>>> across >>>>>>>>>> engines. >>>>>>>>>> >>>>>>>>>> What do you think? >>>>>>>>>> >>>>>>>>>> regards, >>>>>>>>>> Anjali >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen < >>>>>>>>>> pi...@starburstdata.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> FWIW, in Trino we just added Trino views support. >>>>>>>>>>> https://github.com/trinodb/trino/pull/8540 >>>>>>>>>>> Of course, this is by no means usable by other query engines. >>>>>>>>>>> >>>>>>>>>>> Anjali, your document does not talk much about compatibility >>>>>>>>>>> between query engines. >>>>>>>>>>> How do you plan to address that? >>>>>>>>>>> >>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its >>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive. >>>>>>>>>>> I treat it as a great technology supporting transitioning from a >>>>>>>>>>> query engine to a better one. >>>>>>>>>>> However, I would not base a design of some new system for >>>>>>>>>>> storing cross-engine compatible views on that. >>>>>>>>>>> >>>>>>>>>>> Is there something else we can use? >>>>>>>>>>> Maybe the view definition should use some >>>>>>>>>>> intermediate structured language that's not SQL? >>>>>>>>>>> For example, it could represent logical structure of >>>>>>>>>>> operations in semantics manner. >>>>>>>>>>> This would eliminate need for cross-engine compatible parsing >>>>>>>>>>> and analysis. >>>>>>>>>>> >>>>>>>>>>> Best >>>>>>>>>>> PF >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray <rym...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Anjali! >>>>>>>>>>>> >>>>>>>>>>>> I have left some comments on the document. I unfortunately have >>>>>>>>>>>> to miss the community meetup tomorrow but would love to chat >>>>>>>>>>>> more/help w/ >>>>>>>>>>>> implementation. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Ryan >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood >>>>>>>>>>>> <anorw...@netflix.com.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> John Zhuge and I would like to propose the following spec for >>>>>>>>>>>>> storing view metadata in Iceberg. The proposal has been >>>>>>>>>>>>> implemented [1] and >>>>>>>>>>>>> is in production at Netflix for over 15 months. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view >>>>>>>>>>>>> >>>>>>>>>>>>> Please let us know your thoughts by adding comments to the doc. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Anjali. >>>>>>>>>>>>> >>>>>>>>>>>> >> >> -- >> Ryan Blue >> Tabular >> > -- > John Zhuge >