+1 on this, should we add an IR field / type to View <https://iceberg.apache.org/view-spec/> spec <https://iceberg.apache.org/view-spec/> then? Sorry if its discussed already, catching up a bit on the context.
Thanks Szehon On Fri, Oct 25, 2024 at 11:51 AM rdb...@gmail.com <rdb...@gmail.com> wrote: > Substrait is one of the reasons why we designed views with the ability to > have different representations. I think that SQL translation is not a great > solution. I'd like to see more focus on a portable intermediate > representation like Substrait. That would solve a lot of the limitations > with the SQL representation (like not wanting to rewrite after resolution). > > On Fri, Oct 25, 2024 at 10:21 AM Szehon Ho <szehon.apa...@gmail.com> > wrote: > >> Im dont have hands on experience on Substrait, but wondering, is >> substrait representation possible today with existing Iceberg view spec? >> Ie, engines can store today the text serialized substrait representation >> with sql dialect 'substrait'? Or is it an abuse of spec and we should >> make a proper field for IR representations? >> >> I imagine if its possible, interested engines can start working on >> translating to and from substrait, and hopefully it will catch on, but as >> of now Im not sure that Iceberg even provides a standard option? >> >> Thoughts? >> Szehon >> >> On Tue, Oct 22, 2024 at 4:05 AM Will Raschkowski >> <wraschkow...@palantir.com.invalid> wrote: >> >>> It’s all nascent, but we have a relatively solid experience going from >>> Spark to Substrait [1], Substrait to Calcite [2], and Calcite to one of its >>> supported SQL dialects [3]. All these translations are Java-based. >>> >>> >>> [1]: https://github.com/substrait-io/substrait-java/pull/271 >>> >>> [2]: >>> https://github.com/substrait-io/substrait-java/blob/main/isthmus/README.md >>> >>> [3]: >>> https://svn.apache.org/repos/asf/calcite/site/apidocs/org/apache/calcite/sql/dialect/package-summary.html >>> >>> >>> >>> *From: *Ajantha Bhat <ajanthab...@gmail.com> >>> *Date: *Tuesday, 22 October 2024 at 08:22 >>> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org> >>> *Subject: *Re: [Discuss] Iceberg View Interoperability >>> >>> *CAUTION:* This email originates from an external party (outside of >>> Palantir). If you believe this message is suspicious in nature, please use >>> the "Report Message" button built into Outlook. >>> >>> >>> >>> Thanks Dan for the reply. >>> >>> but I'm not convinced using a procedure is a good idea or really moves >>> things forward in that direction. >>> I just feel like the procedure approach has a number of drawbacks >>> including, relying on the user to do the translation, being dependent on >>> Spark for defining the other representations (the API would typically be >>> available to you if you're using spark), there are a number of >>> quoting/escaping issues that make it awkward to define, and it introduces a >>> second way to define a view. >>> >>> >>> I didn't consider this as a problem as the procedure is not exposing any >>> new functionalities, users can still achieve this using java API and end up >>> with the same situations mentioned above. Also, not just Spark, each engine >>> can have this procedure. >>> But I see what you mean, having a centralized solution independent of >>> the engine can be better here. >>> >>> I feel like PyIceberg is the right path forward here. >>> >>> >>> >>> If we are using the python based solutions like SQLGlot, it makes sense >>> to use the PyIceberg to expose a view API to translate and add additional >>> dialects from it. But SQLGlot also doesn't support every engine out there >>> (Like Flink, Dremio, Impala etc). So, we may still need an API to manually >>> add the dialect for unsupported engines from python. >>> >>> I am sure we have people here who work on Coral, SQLGlot, Substrait and >>> similar technologies in this mailing list. I will wait for a week to see if >>> there are any opinions on the same. If not, we can start the development >>> with SQLGlot. >>> >>> - Ajantha >>> >>> >>> >>> On Thu, Oct 17, 2024 at 9:32 PM Daniel Weeks <dwe...@apache.org> wrote: >>> >>> Hey Ajantha, >>> >>> >>> >>> I think it's good to figure out a path forward for extending view >>> support, but I'm not convinced using a procedure is a good idea or really >>> moves things forward in that direction. >>> >>> >>> >>> As you already indicated, there are a number of different libraries to >>> translate views, but of the various options, It feels like there are two >>> approaches that resonate with me are either letting the engine perform the >>> translation (like Trino has done with Coral) or supporting views in >>> python [github.com] >>> <https://urldefense.com/v3/__https:/github.com/apache/iceberg-python/issues/818__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JObc743Gc$> >>> so that you can use a combination of libraries like SQLGlot. Python is the >>> most universally available and is already used in a lot of tooling for >>> running/managing DDL statements. >>> >>> >>> >>> I just feel like the procedure approach has a number of drawbacks >>> including, relying on the user to do the translation, being dependent on >>> Spark for defining the other representations (the API would typically be >>> available to you if you're using spark), there are a number of >>> quoting/escaping issues that make it awkward to define, and it introduces a >>> second way to define a view. >>> >>> >>> >>> Long-term IR (e.g. substrate) is the ideal, but I have questions about >>> whether we'll be able to achieve enough consensus to gain wide adoption. I >>> feel like PyIceberg is the right path forward here. >>> >>> >>> >>> -Dan >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Oct 17, 2024 at 5:19 AM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>> Hi everyone, >>> >>> It’s been over six months since Iceberg views were introduced (in >>> version 1.5.0), and while we’ve seen some benefits—such as versioning and >>> cross-engine recognition—there’s still a critical gap in terms of true >>> interoperability. Although views created by one engine are recognized by >>> another, they currently cannot be queried across engines as one engine >>> cannot understand the SQL dialect of another engine. >>> >>> To address this, I propose a two-step approach: >>> >>> >>> *Step 1: A temporary solution to manually add SQL dialects. *It is >>> already possible to do this using some java code but most of the view users >>> interact via SQL, there’s currently no SQL-native method to manually add >>> dialects from other engines into Iceberg views. I suggest adding a call >>> procedure to enable this. I've recently worked on this and have submitted >>> the following PRs for review: >>> >>> - Spark-3.5: Refactor BaseProcedure to support views [#11326 >>> [github.com] >>> >>> <https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11326__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOcbrMwYs$> >>> ] >>> - Spark-3.5: Procedure to add view dialect [#11327 [github.com] >>> >>> <https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11327__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOdRsXqDS$> >>> ] >>> >>> With these changes, Iceberg views could be queried across different >>> engines using their respective dialects. >>> Please review and provide feedback. >>> >>> >>> *Step 2: A long-term solution for full view interoperability. *For a >>> sustainable solution, we need to consider both open-source and commercial >>> engines (e.g., Snowflake, Dremio, BigQuery). A framework that can >>> automatically translate SQL dialects from one engine to another seems ideal. >>> >>> I’ve explored Coral*[1]*, which works well with open-source engines >>> like Hive, Trino, and Spark, but doesn’t fully cover commercial engines. In >>> contrast, SQLGlot*[2]* seems to be a better fit, as it supports 23 SQL >>> dialects and would provide a more robust solution for interoperability. But >>> SQLGlot is a python project. So, we need REST or Py4J to work with SQLGlot. >>> >>> Additionally, I looked into Substrait*[3]* as a potential intermediate >>> representation for views, but as per my understanding, integration with >>> commercial engines isn’t quite ready yet. This technology has broader >>> applications beyond views, including table planning, and may be worth >>> revisiting in the future. >>> >>> Given this, I believe that incorporating a translation layer with >>> SQLGlot is the right direction. If no one is currently working on view >>> interoperability, I’d be happy to explore this further and take the lead. I >>> welcome any thoughts or feedback from those involved in related projects. >>> >>> Looking forward to hearing your opinions. >>> >>> [1] https://github.com/linkedin/coral [github.com] >>> <https://urldefense.com/v3/__https:/github.com/linkedin/coral__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOaU9qn72$> >>> [2] https://sqlglot.com/sqlglot.html [sqlglot.com] >>> <https://urldefense.com/v3/__https:/sqlglot.com/sqlglot.html__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOSbK5Ggx$> >>> [3] https://substrait.io/ [substrait.io] >>> <https://urldefense.com/v3/__https:/substrait.io/__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JORcivRrB$> >>> >>> - Ajantha >>> >>>