Re: [Discuss] Iceberg View Interoperability

Will Raschkowski Tue, 22 Oct 2024 04:06:49 -0700

It’s all nascent, but we have a relatively solid experience going from Spark to 
Substrait [1], Substrait to Calcite [2], and Calcite to one of its supported 
SQL dialects [3]. All these translations are Java-based.

[1]: https://github.com/substrait-io/substrait-java/pull/271
[2]: https://github.com/substrait-io/substrait-java/blob/main/isthmus/README.md
[3]: 
https://svn.apache.org/repos/asf/calcite/site/apidocs/org/apache/calcite/sql/dialect/package-summary.html

From: Ajantha Bhat <[email protected]>
Date: Tuesday, 22 October 2024 at 08:22
To: [email protected] <[email protected]>
Subject: Re: [Discuss] Iceberg View Interoperability
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Thanks Dan for the reply.
but I'm not convinced using a procedure is a good idea or really moves things 
forward in that direction.
I just feel like the procedure approach has a number of drawbacks including, 
relying on the user to do the translation, being dependent on Spark for 
defining the other representations (the API would typically be available to you 
if you're using spark), there are a number of quoting/escaping issues that make 
it awkward to define, and it introduces a second way to define a view.

I didn't consider this as a problem as the procedure is not exposing any new 
functionalities, users can still achieve this using java API and end up with 
the same situations mentioned above. Also, not just Spark, each engine can have 
this procedure.
But I see what you mean, having a centralized solution independent of the 
engine can be better here.
I feel like PyIceberg is the right path forward here.

If we are using the python based solutions like SQLGlot, it makes sense to use 
the PyIceberg to expose a view API to translate and add additional dialects 
from it. But SQLGlot also doesn't support every engine out there (Like Flink, 
Dremio, Impala etc). So, we may still need an API to manually add the dialect 
for unsupported engines from python.

I am sure we have people here who work on Coral, SQLGlot, Substrait and similar 
technologies in this mailing list. I will wait for a week to see if there are 
any opinions on the same. If not, we can start the development with SQLGlot.

- Ajantha

On Thu, Oct 17, 2024 at 9:32 PM Daniel Weeks 
<[email protected]<mailto:[email protected]>> wrote:
Hey Ajantha,

I think it's good to figure out a path forward for extending view support, but 
I'm not convinced using a procedure is a good idea or really moves things 
forward in that direction.

As you already indicated, there are a number of different libraries to 
translate views, but of the various options, It feels like there are two 
approaches that resonate with me are either letting the engine perform the 
translation (like Trino has done with Coral) or supporting views in python 
[github.com]<https://urldefense.com/v3/__https:/github.com/apache/iceberg-python/issues/818__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JObc743Gc$>
 so that you can use a combination of libraries like SQLGlot.  Python is the 
most universally available and is already used in a lot of tooling for 
running/managing DDL statements.

I just feel like the procedure approach has a number of drawbacks including, 
relying on the user to do the translation, being dependent on Spark for 
defining the other representations (the API would typically be available to you 
if you're using spark), there are a number of quoting/escaping issues that make 
it awkward to define, and it introduces a second way to define a view.

Long-term IR (e.g. substrate) is the ideal, but I have questions about whether 
we'll be able to achieve enough consensus to gain wide adoption.  I feel like 
PyIceberg is the right path forward here.

-Dan

On Thu, Oct 17, 2024 at 5:19 AM Ajantha Bhat 
<[email protected]<mailto:[email protected]>> wrote:
Hi everyone,

It’s been over six months since Iceberg views were introduced (in version 
1.5.0), and while we’ve seen some benefits—such as versioning and cross-engine 
recognition—there’s still a critical gap in terms of true interoperability. 
Although views created by one engine are recognized by another, they currently 
cannot be queried across engines as one engine cannot understand the SQL 
dialect of another engine.

To address this, I propose a two-step approach:

Step 1: A temporary solution to manually add SQL dialects.
It is already possible to do this using some java code but most of the view 
users interact via SQL, there’s currently no SQL-native method to manually add 
dialects from other engines into Iceberg views. I suggest adding a call 
procedure to enable this. I've recently worked on this and have submitted the 
following PRs for review:

  *   Spark-3.5: Refactor BaseProcedure to support views [#11326 
[github.com]<https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11326__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOcbrMwYs$>]
  *   Spark-3.5: Procedure to add view dialect [#11327 
[github.com]<https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11327__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOdRsXqDS$>]
With these changes, Iceberg views could be queried across different engines 
using their respective dialects.
Please review and provide feedback.

Step 2: A long-term solution for full view interoperability.
For a sustainable solution, we need to consider both open-source and commercial 
engines (e.g., Snowflake, Dremio, BigQuery). A framework that can automatically 
translate SQL dialects from one engine to another seems ideal.

I’ve explored Coral[1], which works well with open-source engines like Hive, 
Trino, and Spark, but doesn’t fully cover commercial engines. In contrast, 
SQLGlot[2] seems to be a better fit, as it supports 23 SQL dialects and would 
provide a more robust solution for interoperability. But SQLGlot is a python 
project. So, we need REST or Py4J to work with SQLGlot.

Additionally, I looked into Substrait[3] as a potential intermediate 
representation for views, but as per my understanding, integration with 
commercial engines isn’t quite ready yet. This technology has broader 
applications beyond views, including table planning, and may be worth 
revisiting in the future.

Given this, I believe that incorporating a translation layer with SQLGlot is 
the right direction. If no one is currently working on view interoperability, 
I’d be happy to explore this further and take the lead. I welcome any thoughts 
or feedback from those involved in related projects.

Looking forward to hearing your opinions.

[1] https://github.com/linkedin/coral 
[github.com]<https://urldefense.com/v3/__https:/github.com/linkedin/coral__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOaU9qn72$>
[2] https://sqlglot.com/sqlglot.html 
[sqlglot.com]<https://urldefense.com/v3/__https:/sqlglot.com/sqlglot.html__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOSbK5Ggx$>
[3] https://substrait.io/ 
[substrait.io]<https://urldefense.com/v3/__https:/substrait.io/__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JORcivRrB$>

- Ajantha

Re: [Discuss] Iceberg View Interoperability

Reply via email to