Re: [Discuss] Iceberg View Interoperability

Szehon Ho Fri, 25 Oct 2024 10:21:17 -0700

Im dont have hands on experience on Substrait, but wondering, is substrait
representation possible today with existing Iceberg view spec?  Ie, engines
can store today the text serialized substrait representation with sql
dialect 'substrait'?   Or is it an abuse of spec and we should make a
proper field for IR representations?


I imagine if its possible, interested engines can start working on
translating to and from substrait, and hopefully it will catch on, but as
of now Im not sure that Iceberg even provides a standard option?

Thoughts?
Szehon

On Tue, Oct 22, 2024 at 4:05 AM Will Raschkowski
<[email protected]> wrote:

> It’s all nascent, but we have a relatively solid experience going from
> Spark to Substrait [1], Substrait to Calcite [2], and Calcite to one of its
> supported SQL dialects [3]. All these translations are Java-based.
>
>
> [1]: https://github.com/substrait-io/substrait-java/pull/271
>
> [2]:
> https://github.com/substrait-io/substrait-java/blob/main/isthmus/README.md
>
> [3]:
> https://svn.apache.org/repos/asf/calcite/site/apidocs/org/apache/calcite/sql/dialect/package-summary.html
>
>
>
> *From: *Ajantha Bhat <[email protected]>
> *Date: *Tuesday, 22 October 2024 at 08:22
> *To: *[email protected] <[email protected]>
> *Subject: *Re: [Discuss] Iceberg View Interoperability
>
> *CAUTION:* This email originates from an external party (outside of
> Palantir). If you believe this message is suspicious in nature, please use
> the "Report Message" button built into Outlook.
>
>
>
> Thanks Dan for the reply.
>
> but I'm not convinced using a procedure is a good idea or really moves
> things forward in that direction.
> I just feel like the procedure approach has a number of drawbacks
> including, relying on the user to do the translation, being dependent on
> Spark for defining the other representations (the API would typically be
> available to you if you're using spark), there are a number of
> quoting/escaping issues that make it awkward to define, and it introduces a
> second way to define a view.
>
>
> I didn't consider this as a problem as the procedure is not exposing any
> new functionalities, users can still achieve this using java API and end up
> with the same situations mentioned above. Also, not just Spark, each engine
> can have this procedure.
> But I see what you mean, having a centralized solution independent of the
> engine can be better here.
>
> I feel like PyIceberg is the right path forward here.
>
>
>
> If we are using the python based solutions like SQLGlot, it makes sense to
> use the PyIceberg to expose a view API to translate and add additional
> dialects from it. But SQLGlot also doesn't support every engine out there
> (Like Flink, Dremio, Impala etc). So, we may still need an API to manually
> add the dialect for unsupported engines from python.
>
> I am sure we have people here who work on Coral, SQLGlot, Substrait and
> similar technologies in this mailing list. I will wait for a week to see if
> there are any opinions on the same. If not, we can start the development
> with SQLGlot.
>
> - Ajantha
>
>
>
> On Thu, Oct 17, 2024 at 9:32 PM Daniel Weeks <[email protected]> wrote:
>
> Hey Ajantha,
>
>
>
> I think it's good to figure out a path forward for extending view support,
> but I'm not convinced using a procedure is a good idea or really moves
> things forward in that direction.
>
>
>
> As you already indicated, there are a number of different libraries to
> translate views, but of the various options, It feels like there are two
> approaches that resonate with me are either letting the engine perform the
> translation (like Trino has done with Coral) or supporting views in
> python [github.com]
> <https://urldefense.com/v3/__https:/github.com/apache/iceberg-python/issues/818__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JObc743Gc$>
> so that you can use a combination of libraries like SQLGlot.  Python is the
> most universally available and is already used in a lot of tooling for
> running/managing DDL statements.
>
>
>
> I just feel like the procedure approach has a number of drawbacks
> including, relying on the user to do the translation, being dependent on
> Spark for defining the other representations (the API would typically be
> available to you if you're using spark), there are a number of
> quoting/escaping issues that make it awkward to define, and it introduces a
> second way to define a view.
>
>
>
> Long-term IR (e.g. substrate) is the ideal, but I have questions about
> whether we'll be able to achieve enough consensus to gain wide adoption.  I
> feel like PyIceberg is the right path forward here.
>
>
>
> -Dan
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Oct 17, 2024 at 5:19 AM Ajantha Bhat <[email protected]>
> wrote:
>
> Hi everyone,
>
> It’s been over six months since Iceberg views were introduced (in version
> 1.5.0), and while we’ve seen some benefits—such as versioning and
> cross-engine recognition—there’s still a critical gap in terms of true
> interoperability. Although views created by one engine are recognized by
> another, they currently cannot be queried across engines as one engine
> cannot understand the SQL dialect of another engine.
>
> To address this, I propose a two-step approach:
>
>
> *Step 1: A temporary solution to manually add SQL dialects. *It is
> already possible to do this using some java code but most of the view users
> interact via SQL, there’s currently no SQL-native method to manually add
> dialects from other engines into Iceberg views. I suggest adding a call
> procedure to enable this. I've recently worked on this and have submitted
> the following PRs for review:
>
>    - Spark-3.5: Refactor BaseProcedure to support views [#11326
>    [github.com]
>    
> <https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11326__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOcbrMwYs$>
>    ]
>    - Spark-3.5: Procedure to add view dialect [#11327 [github.com]
>    
> <https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11327__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOdRsXqDS$>
>    ]
>
> With these changes, Iceberg views could be queried across different
> engines using their respective dialects.
> Please review and provide feedback.
>
>
> *Step 2: A long-term solution for full view interoperability. *For a
> sustainable solution, we need to consider both open-source and commercial
> engines (e.g., Snowflake, Dremio, BigQuery). A framework that can
> automatically translate SQL dialects from one engine to another seems ideal.
>
> I’ve explored Coral*[1]*, which works well with open-source engines like
> Hive, Trino, and Spark, but doesn’t fully cover commercial engines. In
> contrast, SQLGlot*[2]* seems to be a better fit, as it supports 23 SQL
> dialects and would provide a more robust solution for interoperability. But
> SQLGlot is a python project. So, we need REST or Py4J to work with SQLGlot.
>
> Additionally, I looked into Substrait*[3]* as a potential intermediate
> representation for views, but as per my understanding, integration with
> commercial engines isn’t quite ready yet. This technology has broader
> applications beyond views, including table planning, and may be worth
> revisiting in the future.
>
> Given this, I believe that incorporating a translation layer with SQLGlot
> is the right direction. If no one is currently working on view
> interoperability, I’d be happy to explore this further and take the lead. I
> welcome any thoughts or feedback from those involved in related projects.
>
> Looking forward to hearing your opinions.
>
> [1] https://github.com/linkedin/coral [github.com]
> <https://urldefense.com/v3/__https:/github.com/linkedin/coral__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOaU9qn72$>
> [2] https://sqlglot.com/sqlglot.html [sqlglot.com]
> <https://urldefense.com/v3/__https:/sqlglot.com/sqlglot.html__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOSbK5Ggx$>
> [3] https://substrait.io/ [substrait.io]
> <https://urldefense.com/v3/__https:/substrait.io/__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JORcivRrB$>
>
> - Ajantha
>
>

Re: [Discuss] Iceberg View Interoperability

Reply via email to