I'll point a link to this thread on dev@arrow so people not subscribed to both can way in? Should we maybe start a new thread about IR?
https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit is the document that Wes wrote up that started the conversation. On Thu, Aug 26, 2021 at 1:42 PM Ryan Blue <b...@tabular.io> wrote: > Micah or someone more familiar (Jacques?), could you summarize some of the > details of the Arrow IR proposal? What are the intended use cases and > goals? We'd want to see whether the goals align with our own. > > On Thu, Aug 26, 2021 at 9:24 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Small tangent: >> >> Regarding an IR, Apache Arrow is looking at this now [1] with several RFC >> proposals. It would be nice to coordinate, at least to make sure >> translation from one IR to another isn't too onerous (in an ideal world we >> could maybe share the same one). Either way external feedback would be >> useful. >> >> Cheers, >> Micah >> >> >> [1] >> https://mail-archives.apache.org/mod_mbox/arrow-dev/202108.mbox/%3cCAKa9qDmLq_4dcmFTJJWpp=ejd0qpl9lygynq2jagxo+lfer...@mail.gmail.com%3e >> >> On Thu, Aug 26, 2021 at 8:21 AM Ryan Blue <b...@tabular.io> wrote: >> >>> I think that the current proposal is looking good, but it is always a >>> good idea to give people a few days to review it and bring up any issues or >>> further discussion on the topics in this thread. >>> >>> I'll also add this to the next sync agenda so we can farm for dissent >>> next week. Sometimes you can get a better read on what is still concerning >>> in person. >>> >>> Ryan >>> >>> On Wed, Aug 25, 2021 at 9:46 PM Anjali Norwood >>> <anorw...@netflix.com.invalid> wrote: >>> >>>> Hello All, >>>> I answered all the comments and made changes to the spec to reflect >>>> them .. the doc now shows the new revision (link here: >>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit# >>>> ). >>>> Please diff with the version named 'Spec v1' in order to see the >>>> deltas. Please let me know if any of your comments are not >>>> satisfactorily addressed. >>>> >>>> @Ryan, Thank you for the insightful feedback, especially around the use >>>> of Iceberg data types for schema and possible evolution. >>>> I agree with your comment: *"I think we should move forward with the >>>> proposal as it is, and pursue type annotations and possibly IR in >>>> parallel."* >>>> How do we achieve consensus/farm for dissent and move forward with the >>>> proposal? >>>> >>>> thanks, >>>> Anjali. >>>> >>>> >>>> >>>> On Sun, Aug 22, 2021 at 11:12 AM John Zhuge <jzh...@apache.org> wrote: >>>> >>>>> +1 >>>>> >>>>> On Sun, Aug 22, 2021 at 10:02 AM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> Thanks for working on this, Anjali! It’s great to see the thorough >>>>>> discussion here with everyone. >>>>>> >>>>>> For the discussion about SQL dialect, I think that the right first >>>>>> step is to capture the SQL or query dialect. That will give us the most >>>>>> flexibility. Engines that can use Coral for translation can attempt to >>>>>> convert and engines that don’t can see if the SQL is valid and can be >>>>>> used. >>>>>> >>>>>> I think that the idea to create a minimal IR is an interesting one, >>>>>> but can be added later. We will always need to record the SQL and >>>>>> dialect, >>>>>> even if we translate to IR because users are going to configure views >>>>>> using >>>>>> SQL. Uses like showing view history or debugging need to show the >>>>>> original >>>>>> SQL, plus relevant information like where it was created and the SQL >>>>>> dialect. We should be able to add this later by adding additional >>>>>> metadata >>>>>> to the view definition. I don’t think that it would introduce breaking >>>>>> changes to add a common representation that can be optionally consumed. >>>>>> >>>>>> Let’s continue talking about a minimal IR, separately. View >>>>>> translation is a hard problem. Right now, to get views across engines we >>>>>> have to compromise confidence. IR is a way to have strong confidence, but >>>>>> with limited expressibility. I think that’s a good trade in a lot of >>>>>> cases >>>>>> and is worth pursuing, even if it will take a long time. >>>>>> >>>>>> Jacques makes a great point about types, but I think that the right >>>>>> option here is to continue using Iceberg types. We’ve already had >>>>>> discussions about whether Iceberg should support annotating types with >>>>>> engine-specific ones, so we have a reasonable way to improve this while >>>>>> also providing compatibility across engines: char(n) is not >>>>>> necessarily supported everywhere and mapping it to string will make >>>>>> sense in most places. The schema is primarily used to validate that the >>>>>> data produced by the query hasn’t changed and that is more about the >>>>>> number >>>>>> of columns in structs and the names of fields rather than exact types. We >>>>>> can fix up types when substituting without losing too much: if the SQL >>>>>> produces a varchar(10) field that the view metadata says is a string, >>>>>> then it’s okay that it is varchar(10). There is some loss in that we >>>>>> don’t know if it was originally varchar(5), but I think that this is >>>>>> not going to cause too many issues. Not all engines will even validate >>>>>> that >>>>>> the schema has not changed, since it could be valid to use select * >>>>>> from x where ... and allow new fields to appear. >>>>>> >>>>>> Right now, I think we should move forward with the proposal as it is, >>>>>> and pursue type annotations and possibly IR in parallel. Does that sound >>>>>> reasonable to everyone? >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Thu, Jul 29, 2021 at 7:50 AM Piotr Findeisen < >>>>>> pi...@starburstdata.com> wrote: >>>>>> >>>>>>> Hi Anjali, >>>>>>> >>>>>>> That's a nice summary. >>>>>>> >>>>>>> re dialect field. It shouldn't be a bit trouble to have it (or any >>>>>>> other way to identify application that created the view), and it might >>>>>>> be >>>>>>> useful. >>>>>>> Why not make it required from the start? >>>>>>> >>>>>>> re "expanded/resolved SQL" -- i don't understand yet what we would >>>>>>> put there, so cannot comment. >>>>>>> >>>>>>> I agree there it's nice to get something out of the door, and I see >>>>>>> how the current proposal fits some needs already. >>>>>>> However, i am concerned about the proliferation of non-cross-engine >>>>>>> compatible views, if we do that. >>>>>>> >>>>>>> Also, if we later agree on any compatible approach (portable subset >>>>>>> of SQL, engine-agnostic IR, etc.), then from the perspective of each >>>>>>> engine, it would be a breaking change. >>>>>>> Unless we make the compatible approach as expressive as full power >>>>>>> of SQL, some views that are possible to create in v1 will not be >>>>>>> possible >>>>>>> to create in v2. >>>>>>> Thus, if v1 is "some SQL" and v2 is "something awesomely >>>>>>> compatible", we may not be able to roll it out. >>>>>>> >>>>>>> > the convention of common SQL has been working for a majority of >>>>>>> users. SQL features commonly used are column projections, simple filter >>>>>>> application, joins, grouping and common aggregate and scalar function. A >>>>>>> few users occasionally would like to use Trino or Spark specific >>>>>>> functions >>>>>>> but are sometimes able to find a way to use a function that is common to >>>>>>> both the engines. >>>>>>> >>>>>>> >>>>>>> it's an awesome summary of what constructs are necessary to be able >>>>>>> to define useful views, while also keep them portable. >>>>>>> >>>>>>> To be able to express column projections, simple filter application, >>>>>>> joins, grouping and common aggregate and scalar function in a structured >>>>>>> IR, how much effort do you think would be required? >>>>>>> We didn't really talk about downsides of a structured approach, >>>>>>> other than it looks complex. >>>>>>> if we indeed estimate it as a multi-year effort, i wouldn't argue >>>>>>> for that. Maybe i were overly optimistic though. >>>>>>> >>>>>>> >>>>>>> As Jack mentioned, for engine-specific approach that's not supposed >>>>>>> to be consumed by multiple engines, we may be better served with >>>>>>> approach >>>>>>> that's outside of Iceberg spec, like >>>>>>> https://github.com/trinodb/trino/pull/8540. >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> PF >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 29, 2021 at 12:33 PM Anjali Norwood >>>>>>> <anorw...@netflix.com.invalid> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Thank you for all the comments. I will try to address them all here >>>>>>>> together. >>>>>>>> >>>>>>>> >>>>>>>> - @all Cross engine compatibility of view definition: Multiple >>>>>>>> options such as engine agnostic SQL or IR of some form have been >>>>>>>> mentioned. >>>>>>>> We can all agree that all of these options are non-trivial to >>>>>>>> design/implement (perhaps a multi-year effort based on the option >>>>>>>> chosen) >>>>>>>> and merit further discussion. I would like to suggest that we >>>>>>>> continue this >>>>>>>> discussion but target this work for the future (v2?). In v1, we can >>>>>>>> add an >>>>>>>> optional dialect field and an optional expanded/resolved SQL field >>>>>>>> that can >>>>>>>> be interpreted by engines as they see fit. V1 can unlock many use >>>>>>>> cases >>>>>>>> where the views are either accessed by a single engine or >>>>>>>> multi-engine use >>>>>>>> cases where a (common) subset of SQL is supported. This proposal >>>>>>>> allows for >>>>>>>> desirable features such as versioning of views and a common format >>>>>>>> of >>>>>>>> storing view metadata while allowing extensibility in the future. >>>>>>>> *Does >>>>>>>> anyone feel strongly otherwise?* >>>>>>>> - @Piotr As for common views at Netflix, the restrictions on >>>>>>>> SQL are not enforced, but are advised as best practices. The >>>>>>>> convention of >>>>>>>> common SQL has been working for a majority of users. SQL features >>>>>>>> commonly >>>>>>>> used are column projections, simple filter application, joins, >>>>>>>> grouping and >>>>>>>> common aggregate and scalar function. A few users occasionally >>>>>>>> would like >>>>>>>> to use Trino or Spark specific functions but are sometimes able to >>>>>>>> find a >>>>>>>> way to use a function that is common to both the engines. >>>>>>>> - @Jacques and @Jack Iceberg data types are engine agnostic and >>>>>>>> hence were picked for storing view schema. Thinking further, the >>>>>>>> schema >>>>>>>> field should be made 'optional', since not all engines require it. >>>>>>>> (e.g. >>>>>>>> Spark does not need it and Trino uses it only for validation). >>>>>>>> - @Jacques Table references in the views can be arbitrary >>>>>>>> objects such as tables from other catalogs or elasticsearch tables >>>>>>>> etc. I >>>>>>>> will clarify it in the spec. >>>>>>>> >>>>>>>> I will work on incorporating all the comments in the spec and make >>>>>>>> the next revision available for review soon. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Anjali. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jul 27, 2021 at 2:51 AM Piotr Findeisen < >>>>>>>> pi...@starburstdata.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Thanks Jack and Jacques for sharing your thoughts. >>>>>>>>> >>>>>>>>> I agree that tracking dialect/origin is better than nothing. >>>>>>>>> I think having a Map {dialect: sql} is not going to buy us much. >>>>>>>>> I.e. it would be useful if there was some external app (or a human >>>>>>>>> being) that would write those alternative SQLs for each dialect. >>>>>>>>> Otherwise I am not imagining Spark writing SQL for Spark and >>>>>>>>> Trino, or Trino writing SQL for Trino and Spark. >>>>>>>>> >>>>>>>>> Thanks Jacques for a good summary of SQL supporting options. >>>>>>>>> While i like the idea of starting with Trino SQL ANTLR grammar >>>>>>>>> file (it's really well written and resembles spec quite well), you >>>>>>>>> made a >>>>>>>>> good point that grammar is only part of the problem. Coercions, >>>>>>>>> function >>>>>>>>> resolution, dereference resolution, table resolution are part of query >>>>>>>>> analysis that goes beyond just grammar. >>>>>>>>> In fact, column scoping rules -- while clearly defined by the spec >>>>>>>>> -- may easily differ between engines (pretty usual). >>>>>>>>> That's why i would rather lean towards some intermediate >>>>>>>>> representation that is *not *SQL, doesn't require parsing (is >>>>>>>>> already structural), nor analysis (no scopes! no implicit coercions!). >>>>>>>>> Before we embark on such a journey, it would be interesting to >>>>>>>>> hear @Martin Traverso <mar...@starburstdata.com> 's thoughts on >>>>>>>>> feasibility though. >>>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> PF >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jul 23, 2021 at 3:27 AM Jacques Nadeau < >>>>>>>>> jacquesnad...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Some thoughts... >>>>>>>>>> >>>>>>>>>> - In general, many engines want (or may require) a resolved >>>>>>>>>> sql field. This--at minimum--typically includes star expansion >>>>>>>>>> since >>>>>>>>>> traditional view behavior is stars are expanded at view creation >>>>>>>>>> time >>>>>>>>>> (since this is the only way to guarantee that the view returns >>>>>>>>>> the same >>>>>>>>>> logical definition even if the underlying table changes). This >>>>>>>>>> may also >>>>>>>>>> include a replacement of relative object names to absolute object >>>>>>>>>> names >>>>>>>>>> based on the session catalog & namespace. If I recall correctly, >>>>>>>>>> Hive does >>>>>>>>>> both of these things. >>>>>>>>>> - It isn't clear in the spec whether the table references >>>>>>>>>> used in views are restricted to other Iceberg objects or can be >>>>>>>>>> arbitrary >>>>>>>>>> objects in the context of a particular engine. Maybe I missed >>>>>>>>>> this? For >>>>>>>>>> example, can I have a Trino engine view that references an >>>>>>>>>> Elasticsearch >>>>>>>>>> table stored in an Iceberg view? >>>>>>>>>> - Restricting schemas to the Iceberg types will likely lead >>>>>>>>>> to unintended consequences. I appreciate the attraction to it but >>>>>>>>>> I think >>>>>>>>>> it would either create artificial barriers around the types of >>>>>>>>>> SQL that are >>>>>>>>>> allowed and/or mean that replacing a CTE with a view could >>>>>>>>>> potentially >>>>>>>>>> change the behavior of the query which I believe violates most >>>>>>>>>> typical >>>>>>>>>> engine behaviors. A good example of this is the simple sql >>>>>>>>>> statement of >>>>>>>>>> "SELECT c1, 'foo' as c2 from table1". In many engines (and >>>>>>>>>> Calcite by >>>>>>>>>> default I believe), c2 will be specified as a CHAR(3). In this >>>>>>>>>> Iceberg >>>>>>>>>> context, is this view disallowed? If it isn't disallowed then you >>>>>>>>>> have an >>>>>>>>>> issue where the view schema will be required to be different from >>>>>>>>>> a CTE >>>>>>>>>> since the engine will resolve it differently than Iceberg. Even >>>>>>>>>> if you >>>>>>>>>> ignore CHAR(X), you've still got VARCHAR(X) to contend with... >>>>>>>>>> - It is important to remember that Calcite is a set of >>>>>>>>>> libraries and not a specification. There are things that can be >>>>>>>>>> specified >>>>>>>>>> in Calcite but in general it doesn't have formal specification as >>>>>>>>>> a first >>>>>>>>>> principle. It is more implementation as a first principle. This >>>>>>>>>> is in >>>>>>>>>> contrast to projects like Arrow and Iceberg, which start with >>>>>>>>>> well-formed >>>>>>>>>> specifications. I've been working with Calcite since before it >>>>>>>>>> was an >>>>>>>>>> Apache project and I wouldn't recommend adopting it as any form >>>>>>>>>> of a >>>>>>>>>> specification. On the flipside, I am very supportive of using it >>>>>>>>>> for a >>>>>>>>>> reference implementation standard for Iceberg view consumption, >>>>>>>>>> manipulation, etc. If anything, I'd suggest we start with the >>>>>>>>>> adoption of >>>>>>>>>> a relatively clear grammar, e.g. the Antlr grammar file that >>>>>>>>>> Spark [1] >>>>>>>>>> and/or Trino [2] use. Even that is not a complete specification >>>>>>>>>> as grammar >>>>>>>>>> must still be interpreted with regards to type promotion, function >>>>>>>>>> resolution, consistent unnamed expression naming, etc that aren't >>>>>>>>>> defined >>>>>>>>>> at the grammar level. I'd definitely avoid using Calcite's JavaCC >>>>>>>>>> grammar >>>>>>>>>> as it heavily embeds implementation details (in a good way) and >>>>>>>>>> relies on >>>>>>>>>> some fairly complex logic in the validator and sql2rel components >>>>>>>>>> to be >>>>>>>>>> fully resolved/comprehended. >>>>>>>>>> >>>>>>>>>> Given the above, I suggest having a field which describes the >>>>>>>>>> dialect(origin?) of the view and then each engine can decide how >>>>>>>>>> they want >>>>>>>>>> to consume/mutate that view (and whether they want to or not). It >>>>>>>>>> does risk >>>>>>>>>> being a dumping ground. Nonetheless, I'd expect the alternative of >>>>>>>>>> establishing a formal SQL specification to be a similarly long >>>>>>>>>> process to >>>>>>>>>> the couple of years it took to build the Arrow and Iceberg >>>>>>>>>> specifications. >>>>>>>>>> (Realistically, there is far more to specify here than there is in >>>>>>>>>> either >>>>>>>>>> of those two domains.) >>>>>>>>>> >>>>>>>>>> Some other notes: >>>>>>>>>> >>>>>>>>>> - Calcite does provide a nice reference document [3] but it >>>>>>>>>> is not sufficient to implement what is necessary for >>>>>>>>>> parsing/validating/resolving a SQL string correctly/consistently. >>>>>>>>>> - Projects like Coral [4] are interesting here but even Coral >>>>>>>>>> is based roughly on "HiveQL" which also doesn't have a formal >>>>>>>>>> specification >>>>>>>>>> process outside of the Hive version you're running. See this >>>>>>>>>> thread in >>>>>>>>>> Coral slack [5] >>>>>>>>>> - ZetaSQL [6] also seems interesting in this space. It feels >>>>>>>>>> closer to specification based [7] than Calcite but is much less >>>>>>>>>> popular in >>>>>>>>>> the big data domain. I also haven't reviewed it's SQL >>>>>>>>>> completeness closely, >>>>>>>>>> a strength of Calcite. >>>>>>>>>> - One of the other problems with building against an >>>>>>>>>> implementation as opposed to a specification (e.g. Calcite) is it >>>>>>>>>> can make >>>>>>>>>> it difficult or near impossible to implement the same algorithms >>>>>>>>>> again >>>>>>>>>> without a bunch of reverse engineering. If interested in an >>>>>>>>>> example of >>>>>>>>>> this, see the discussion behind LZ4 deprecation on the Parquet >>>>>>>>>> spec [8] for >>>>>>>>>> how painful this kind of mistake can become. >>>>>>>>>> - I'd love to use the SQL specification itself but nobody >>>>>>>>>> actually implements that in its entirety and it has far too many >>>>>>>>>> places >>>>>>>>>> where things are "implementation-defined" [9]. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>>>> [2] >>>>>>>>>> https://github.com/trinodb/trino/blob/master/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4 >>>>>>>>>> [3] https://calcite.apache.org/docs/reference.html >>>>>>>>>> [4] https://github.com/linkedin/coral >>>>>>>>>> [5] >>>>>>>>>> https://coral-sql.slack.com/archives/C01FHBJR20Y/p1608064004073000 >>>>>>>>>> [6] https://github.com/google/zetasql >>>>>>>>>> [7] >>>>>>>>>> https://github.com/google/zetasql/blob/master/docs/one-pager.md >>>>>>>>>> [8] >>>>>>>>>> https://github.com/apache/parquet-format/blob/master/Compression.md#lz4 >>>>>>>>>> [9] https://twitter.com/sc13ts/status/1413728808830525440 >>>>>>>>>> >>>>>>>>>> On Thu, Jul 22, 2021 at 12:01 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Did not notice that we are also discussing cross-engine >>>>>>>>>>> interoperability here, I will add my response in the design doc >>>>>>>>>>> here. >>>>>>>>>>> >>>>>>>>>>> I would personally prefer cross-engine interoperability as a >>>>>>>>>>> goal and get the spec in the right structure in the initial release, >>>>>>>>>>> because: >>>>>>>>>>> >>>>>>>>>>> 1. I believe that cross-engine compatibility is a critical >>>>>>>>>>> feature of Iceberg. If I am a user of an existing data lake that >>>>>>>>>>> already >>>>>>>>>>> supports views (e.g. Hive), I don't even need Iceberg to have this >>>>>>>>>>> view >>>>>>>>>>> feature. I can do what is now done for Trino to use views with >>>>>>>>>>> Iceberg. I >>>>>>>>>>> can also just use a table property to indicate the table is a view >>>>>>>>>>> and >>>>>>>>>>> store the view SQL as a table property and do my own thing in any >>>>>>>>>>> query >>>>>>>>>>> engine to support all the view features. One of the most valuable >>>>>>>>>>> and >>>>>>>>>>> unique features that Iceberg view can unlock is to allow a view to >>>>>>>>>>> be >>>>>>>>>>> created in one engine and read by another. Not supporting >>>>>>>>>>> cross-engine >>>>>>>>>>> compatibility feels like losing a lot of value to me. >>>>>>>>>>> >>>>>>>>>>> 2. In the view definition, it feels inconsistent to me that we >>>>>>>>>>> have "schema" as an Iceberg native schema, but "sql" field as the >>>>>>>>>>> view SQL >>>>>>>>>>> that can come from any query engine. If the engine already needs to >>>>>>>>>>> convert >>>>>>>>>>> the view schema to iceberg shema, it should just do the same for >>>>>>>>>>> the view >>>>>>>>>>> SQL. >>>>>>>>>>> >>>>>>>>>>> Regarding the way to achieve it, I think it comes to either >>>>>>>>>>> Apache Calcite (or some other third party alternative I don't know) >>>>>>>>>>> or our >>>>>>>>>>> own implementation of some intermediate representation. I don't >>>>>>>>>>> have a very >>>>>>>>>>> strong opinion, but my thoughts are the following: >>>>>>>>>>> >>>>>>>>>>> 1. Calcite is supposed to be the go-to software to deal with >>>>>>>>>>> this kind of issue, but my personal concern is that the integration >>>>>>>>>>> is >>>>>>>>>>> definitely going to be much more involved, and it will become >>>>>>>>>>> another >>>>>>>>>>> barrier for newer engines to onboard because it not only needs to >>>>>>>>>>> implement >>>>>>>>>>> Iceberg APIs but also needs Calcite support. It will also start to >>>>>>>>>>> become a >>>>>>>>>>> constant discussion around what we maintain and what we should push >>>>>>>>>>> to >>>>>>>>>>> Calcite, similar to our situation today with Spark. >>>>>>>>>>> >>>>>>>>>>> 2. Another way I am leaning towards, as Piotr also suggested, is >>>>>>>>>>> to have a native lightweight logical query structure representation >>>>>>>>>>> of the >>>>>>>>>>> view SQL and store that instead of the SQL string. We already deal >>>>>>>>>>> with >>>>>>>>>>> Expressions in Iceberg, and engines have to convert predicates to >>>>>>>>>>> Iceberg >>>>>>>>>>> expressions for predicate pushdown. So I think it would not be hard >>>>>>>>>>> to >>>>>>>>>>> extend on that to support this use case. Different engines can >>>>>>>>>>> build this >>>>>>>>>>> logical structure when traversing their own AST during a create >>>>>>>>>>> view query. >>>>>>>>>>> >>>>>>>>>>> 3. With these considerations, I think the "sql" field can >>>>>>>>>>> potentially be a map (maybe called "engine-sqls"?), where key is >>>>>>>>>>> the engine >>>>>>>>>>> type and version like "Spark 3.1", and value is the view SQL >>>>>>>>>>> string. In >>>>>>>>>>> this way, the engine that creates the view can still read the SQL >>>>>>>>>>> directly >>>>>>>>>>> which might lead to better engine-native integration and avoid >>>>>>>>>>> redundant >>>>>>>>>>> parsing. But in this approach there is always a default intermediate >>>>>>>>>>> representation it can fallback to when the engine's key is not >>>>>>>>>>> found in the >>>>>>>>>>> map. If we want to make incremental progress and delay the design >>>>>>>>>>> for the >>>>>>>>>>> intermediate representation, I think we should at least use this map >>>>>>>>>>> instead of just a single string. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Jack Ye >>>>>>>>>>> >>>>>>>>>>> On Thu, Jul 22, 2021 at 6:35 AM Piotr Findeisen < >>>>>>>>>>> pi...@starburstdata.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> First of all thank you for this discussion and all the >>>>>>>>>>>> view-related work! >>>>>>>>>>>> >>>>>>>>>>>> I agree that solving cross-engine compatibility problem may not >>>>>>>>>>>> be primary feature today, I am concerned that not thinking about >>>>>>>>>>>> this from >>>>>>>>>>>> the start may "tunnel" us into a wrong direction. >>>>>>>>>>>> Cross-engine compatible views would be such a cool feature that >>>>>>>>>>>> it is hard to just let it pass. >>>>>>>>>>>> >>>>>>>>>>>> My thinking about a smaller IR may be a side-product of me not >>>>>>>>>>>> being familiar enough with Calcite. >>>>>>>>>>>> However, with new IR being focused on compatible >>>>>>>>>>>> representation, and not being tied to anything are actually good >>>>>>>>>>>> things. >>>>>>>>>>>> For example, we need to focus on JSON representation, but we >>>>>>>>>>>> don't need to deal with tree traversal or anything, so the code >>>>>>>>>>>> for this >>>>>>>>>>>> could be pretty simple. >>>>>>>>>>>> >>>>>>>>>>>> > Allow only ANSI-compliant SQL and anything that is truly >>>>>>>>>>>> common across engines in the view definition (this is how >>>>>>>>>>>> currently Netflix >>>>>>>>>>>> uses these 'common' views across Spark and Trino) >>>>>>>>>>>> >>>>>>>>>>>> it's interesting. Anjali, do you have means to enforce that, >>>>>>>>>>>> or is this just a convention? >>>>>>>>>>>> >>>>>>>>>>>> What are the common building blocks (relational operations, >>>>>>>>>>>> constructs and functions) that you found sufficient for expressing >>>>>>>>>>>> your >>>>>>>>>>>> views? >>>>>>>>>>>> Being able to enumerate them could help validate various >>>>>>>>>>>> approaches considered here, including feasibility of dedicated >>>>>>>>>>>> representation. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> PF >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jul 22, 2021 at 2:28 PM Ryan Murray <rym...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hey Anjali, >>>>>>>>>>>>> >>>>>>>>>>>>> I am definitely happy to help with implementing 1-3 in your >>>>>>>>>>>>> first list once the spec has been approved by the community. My >>>>>>>>>>>>> hope is >>>>>>>>>>>>> that the final version of the view spec will make it easy to >>>>>>>>>>>>> re-use existing rollback/time travel/metadata etc functionalities. >>>>>>>>>>>>> >>>>>>>>>>>>> Regarding SQL dialects.My personal opinion is: Enforcing >>>>>>>>>>>>> ANSI-compliant SQL across all engines is hard and probably not >>>>>>>>>>>>> desirable while storing Calcite makes it hard for eg python to >>>>>>>>>>>>> use views. A >>>>>>>>>>>>> project to make a cross language and cross engine IR for sql >>>>>>>>>>>>> views and the >>>>>>>>>>>>> relevant transpilers is imho outside the scope of this spec and >>>>>>>>>>>>> probably >>>>>>>>>>>>> deserving an apache project of its own. A smaller IR like Piotr >>>>>>>>>>>>> suggested >>>>>>>>>>>>> is possible but I feel it will likely quickly snowball into a >>>>>>>>>>>>> larger >>>>>>>>>>>>> project and slow down adoption of the view spec in iceberg. So I >>>>>>>>>>>>> think the >>>>>>>>>>>>> most reasonable way forward is to add a dialect field and a >>>>>>>>>>>>> warning to >>>>>>>>>>>>> engines that views are not (yet) cross compatible. This is at >>>>>>>>>>>>> odds with the >>>>>>>>>>>>> original spirit of iceberg tables and I wonder how the border >>>>>>>>>>>>> community >>>>>>>>>>>>> feels about it? I would hope that we can make the view spec >>>>>>>>>>>>> engine-free >>>>>>>>>>>>> over time and eventually deprecate the dialect field. >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Ryan >>>>>>>>>>>>> >>>>>>>>>>>>> PS if anyone is interested in collaborating on engine agnostic >>>>>>>>>>>>> views please reach out. I am keen on exploring this topic. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jul 20, 2021 at 10:51 PM Anjali Norwood >>>>>>>>>>>>> <anorw...@netflix.com.invalid> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you Ryan (M), Piotr and Vivekanand for the comments. I >>>>>>>>>>>>>> have and will continue to address them in the doc. >>>>>>>>>>>>>> Great to know about Trino views, Piotr! >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks to everybody who has offered help with implementation. >>>>>>>>>>>>>> The spec as it is proposed in the doc has been implemented and >>>>>>>>>>>>>> is in use at >>>>>>>>>>>>>> Netflix (currently on Iceberg 0.9). Once we close the spec, we >>>>>>>>>>>>>> will rebase >>>>>>>>>>>>>> our code to Iceberg-0.12 and incorporate changes to format and >>>>>>>>>>>>>> other feedback from the community and should be able to make >>>>>>>>>>>>>> this MVP >>>>>>>>>>>>>> implementation available quickly as a PR. >>>>>>>>>>>>>> >>>>>>>>>>>>>> A few areas that we have not yet worked on and would love for >>>>>>>>>>>>>> the community to help are: >>>>>>>>>>>>>> 1. Time travel on views: Be able to access the view as of a >>>>>>>>>>>>>> version or time >>>>>>>>>>>>>> 2. History table: A system table implementation for $versions >>>>>>>>>>>>>> similar to the $snapshots table in order to display the history >>>>>>>>>>>>>> of a view >>>>>>>>>>>>>> 3. Rollback to a version: A way to rollback a view to a >>>>>>>>>>>>>> previous version >>>>>>>>>>>>>> 4. Engine agnostic SQL: more below. >>>>>>>>>>>>>> >>>>>>>>>>>>>> One comment that is worth a broader discussion is the dialect >>>>>>>>>>>>>> of the SQL stored in the view metadata. The purpose of the spec >>>>>>>>>>>>>> is to >>>>>>>>>>>>>> provide a storage format for view metadata and APIs to access >>>>>>>>>>>>>> that >>>>>>>>>>>>>> metadata. The dialect of the SQL stored is an orthogonal >>>>>>>>>>>>>> question and is >>>>>>>>>>>>>> outside the scope of this spec. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Nonetheless, it is an important concern, so compiling a few >>>>>>>>>>>>>> suggestions that came up in the comments to continue the >>>>>>>>>>>>>> discussion: >>>>>>>>>>>>>> 1. Allow only ANSI-compliant SQL and anything that is truly >>>>>>>>>>>>>> common across engines in the view definition (this is how >>>>>>>>>>>>>> currently Netflix >>>>>>>>>>>>>> uses these 'common' views across Spark and Trino) >>>>>>>>>>>>>> 2. Add a field to the view metadata to identify the dialect >>>>>>>>>>>>>> of the SQL. This allows for any desired dialect, but no improved >>>>>>>>>>>>>> cross-engine operability >>>>>>>>>>>>>> 3. Store AST produced by Calcite in the view metadata and >>>>>>>>>>>>>> translate back and forth between engine-supported SQL and AST >>>>>>>>>>>>>> 4. Intermediate structured language of our own. (What >>>>>>>>>>>>>> additional functionality does it provide over Calcite?) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Given that the view metadata is json, it is easily extendable >>>>>>>>>>>>>> to incorporate any new fields needed to make the SQL truly >>>>>>>>>>>>>> compatible >>>>>>>>>>>>>> across engines. >>>>>>>>>>>>>> >>>>>>>>>>>>>> What do you think? >>>>>>>>>>>>>> >>>>>>>>>>>>>> regards, >>>>>>>>>>>>>> Anjali >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 3:09 AM Piotr Findeisen < >>>>>>>>>>>>>> pi...@starburstdata.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> FWIW, in Trino we just added Trino views support. >>>>>>>>>>>>>>> https://github.com/trinodb/trino/pull/8540 >>>>>>>>>>>>>>> Of course, this is by no means usable by other query engines. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Anjali, your document does not talk much about compatibility >>>>>>>>>>>>>>> between query engines. >>>>>>>>>>>>>>> How do you plan to address that? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For example, I am familiar with Coral, and I appreciate its >>>>>>>>>>>>>>> powers for dealing with legacy stuff like views defined by Hive. >>>>>>>>>>>>>>> I treat it as a great technology supporting transitioning >>>>>>>>>>>>>>> from a query engine to a better one. >>>>>>>>>>>>>>> However, I would not base a design of some new system for >>>>>>>>>>>>>>> storing cross-engine compatible views on that. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is there something else we can use? >>>>>>>>>>>>>>> Maybe the view definition should use some >>>>>>>>>>>>>>> intermediate structured language that's not SQL? >>>>>>>>>>>>>>> For example, it could represent logical structure of >>>>>>>>>>>>>>> operations in semantics manner. >>>>>>>>>>>>>>> This would eliminate need for cross-engine compatible >>>>>>>>>>>>>>> parsing and analysis. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best >>>>>>>>>>>>>>> PF >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 11:04 AM Ryan Murray < >>>>>>>>>>>>>>> rym...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks Anjali! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have left some comments on the document. I unfortunately >>>>>>>>>>>>>>>> have to miss the community meetup tomorrow but would love to >>>>>>>>>>>>>>>> chat more/help >>>>>>>>>>>>>>>> w/ implementation. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Jul 20, 2021 at 7:42 AM Anjali Norwood >>>>>>>>>>>>>>>> <anorw...@netflix.com.invalid> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> John Zhuge and I would like to propose the following spec >>>>>>>>>>>>>>>>> for storing view metadata in Iceberg. The proposal has been >>>>>>>>>>>>>>>>> implemented [1] >>>>>>>>>>>>>>>>> and is in production at Netflix for over 15 months. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>> https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Please let us know your thoughts by adding comments to the >>>>>>>>>>>>>>>>> doc. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Anjali. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>> -- >>>>> John Zhuge >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >