Substrait is one of the reasons why we designed views with the ability to have different representations. I think that SQL translation is not a great solution. I'd like to see more focus on a portable intermediate representation like Substrait. That would solve a lot of the limitations with the SQL representation (like not wanting to rewrite after resolution).
On Fri, Oct 25, 2024 at 10:21 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > Im dont have hands on experience on Substrait, but wondering, is substrait > representation possible today with existing Iceberg view spec? Ie, engines > can store today the text serialized substrait representation with sql > dialect 'substrait'? Or is it an abuse of spec and we should make a > proper field for IR representations? > > I imagine if its possible, interested engines can start working on > translating to and from substrait, and hopefully it will catch on, but as > of now Im not sure that Iceberg even provides a standard option? > > Thoughts? > Szehon > > On Tue, Oct 22, 2024 at 4:05 AM Will Raschkowski > <wraschkow...@palantir.com.invalid> wrote: > >> It’s all nascent, but we have a relatively solid experience going from >> Spark to Substrait [1], Substrait to Calcite [2], and Calcite to one of its >> supported SQL dialects [3]. All these translations are Java-based. >> >> >> [1]: https://github.com/substrait-io/substrait-java/pull/271 >> >> [2]: >> https://github.com/substrait-io/substrait-java/blob/main/isthmus/README.md >> >> [3]: >> https://svn.apache.org/repos/asf/calcite/site/apidocs/org/apache/calcite/sql/dialect/package-summary.html >> >> >> >> *From: *Ajantha Bhat <ajanthab...@gmail.com> >> *Date: *Tuesday, 22 October 2024 at 08:22 >> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org> >> *Subject: *Re: [Discuss] Iceberg View Interoperability >> >> *CAUTION:* This email originates from an external party (outside of >> Palantir). If you believe this message is suspicious in nature, please use >> the "Report Message" button built into Outlook. >> >> >> >> Thanks Dan for the reply. >> >> but I'm not convinced using a procedure is a good idea or really moves >> things forward in that direction. >> I just feel like the procedure approach has a number of drawbacks >> including, relying on the user to do the translation, being dependent on >> Spark for defining the other representations (the API would typically be >> available to you if you're using spark), there are a number of >> quoting/escaping issues that make it awkward to define, and it introduces a >> second way to define a view. >> >> >> I didn't consider this as a problem as the procedure is not exposing any >> new functionalities, users can still achieve this using java API and end up >> with the same situations mentioned above. Also, not just Spark, each engine >> can have this procedure. >> But I see what you mean, having a centralized solution independent of the >> engine can be better here. >> >> I feel like PyIceberg is the right path forward here. >> >> >> >> If we are using the python based solutions like SQLGlot, it makes sense >> to use the PyIceberg to expose a view API to translate and add additional >> dialects from it. But SQLGlot also doesn't support every engine out there >> (Like Flink, Dremio, Impala etc). So, we may still need an API to manually >> add the dialect for unsupported engines from python. >> >> I am sure we have people here who work on Coral, SQLGlot, Substrait and >> similar technologies in this mailing list. I will wait for a week to see if >> there are any opinions on the same. If not, we can start the development >> with SQLGlot. >> >> - Ajantha >> >> >> >> On Thu, Oct 17, 2024 at 9:32 PM Daniel Weeks <dwe...@apache.org> wrote: >> >> Hey Ajantha, >> >> >> >> I think it's good to figure out a path forward for extending view >> support, but I'm not convinced using a procedure is a good idea or really >> moves things forward in that direction. >> >> >> >> As you already indicated, there are a number of different libraries to >> translate views, but of the various options, It feels like there are two >> approaches that resonate with me are either letting the engine perform the >> translation (like Trino has done with Coral) or supporting views in >> python [github.com] >> <https://urldefense.com/v3/__https:/github.com/apache/iceberg-python/issues/818__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JObc743Gc$> >> so that you can use a combination of libraries like SQLGlot. Python is the >> most universally available and is already used in a lot of tooling for >> running/managing DDL statements. >> >> >> >> I just feel like the procedure approach has a number of drawbacks >> including, relying on the user to do the translation, being dependent on >> Spark for defining the other representations (the API would typically be >> available to you if you're using spark), there are a number of >> quoting/escaping issues that make it awkward to define, and it introduces a >> second way to define a view. >> >> >> >> Long-term IR (e.g. substrate) is the ideal, but I have questions about >> whether we'll be able to achieve enough consensus to gain wide adoption. I >> feel like PyIceberg is the right path forward here. >> >> >> >> -Dan >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Oct 17, 2024 at 5:19 AM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >> Hi everyone, >> >> It’s been over six months since Iceberg views were introduced (in version >> 1.5.0), and while we’ve seen some benefits—such as versioning and >> cross-engine recognition—there’s still a critical gap in terms of true >> interoperability. Although views created by one engine are recognized by >> another, they currently cannot be queried across engines as one engine >> cannot understand the SQL dialect of another engine. >> >> To address this, I propose a two-step approach: >> >> >> *Step 1: A temporary solution to manually add SQL dialects. *It is >> already possible to do this using some java code but most of the view users >> interact via SQL, there’s currently no SQL-native method to manually add >> dialects from other engines into Iceberg views. I suggest adding a call >> procedure to enable this. I've recently worked on this and have submitted >> the following PRs for review: >> >> - Spark-3.5: Refactor BaseProcedure to support views [#11326 >> [github.com] >> >> <https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11326__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOcbrMwYs$> >> ] >> - Spark-3.5: Procedure to add view dialect [#11327 [github.com] >> >> <https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11327__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOdRsXqDS$> >> ] >> >> With these changes, Iceberg views could be queried across different >> engines using their respective dialects. >> Please review and provide feedback. >> >> >> *Step 2: A long-term solution for full view interoperability. *For a >> sustainable solution, we need to consider both open-source and commercial >> engines (e.g., Snowflake, Dremio, BigQuery). A framework that can >> automatically translate SQL dialects from one engine to another seems ideal. >> >> I’ve explored Coral*[1]*, which works well with open-source engines like >> Hive, Trino, and Spark, but doesn’t fully cover commercial engines. In >> contrast, SQLGlot*[2]* seems to be a better fit, as it supports 23 SQL >> dialects and would provide a more robust solution for interoperability. But >> SQLGlot is a python project. So, we need REST or Py4J to work with SQLGlot. >> >> Additionally, I looked into Substrait*[3]* as a potential intermediate >> representation for views, but as per my understanding, integration with >> commercial engines isn’t quite ready yet. This technology has broader >> applications beyond views, including table planning, and may be worth >> revisiting in the future. >> >> Given this, I believe that incorporating a translation layer with SQLGlot >> is the right direction. If no one is currently working on view >> interoperability, I’d be happy to explore this further and take the lead. I >> welcome any thoughts or feedback from those involved in related projects. >> >> Looking forward to hearing your opinions. >> >> [1] https://github.com/linkedin/coral [github.com] >> <https://urldefense.com/v3/__https:/github.com/linkedin/coral__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOaU9qn72$> >> [2] https://sqlglot.com/sqlglot.html [sqlglot.com] >> <https://urldefense.com/v3/__https:/sqlglot.com/sqlglot.html__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOSbK5Ggx$> >> [3] https://substrait.io/ [substrait.io] >> <https://urldefense.com/v3/__https:/substrait.io/__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JORcivRrB$> >> >> - Ajantha >> >>