Re: [Discuss] Iceberg View Interoperability

rdb...@gmail.com Fri, 25 Oct 2024 11:51:15 -0700

Substrait is one of the reasons why we designed views with the ability to
have different representations. I think that SQL translation is not a great
solution. I'd like to see more focus on a portable intermediate
representation like Substrait. That would solve a lot of the limitations
with the SQL representation (like not wanting to rewrite after resolution).


On Fri, Oct 25, 2024 at 10:21 AM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Im dont have hands on experience on Substrait, but wondering, is substrait
> representation possible today with existing Iceberg view spec?  Ie, engines
> can store today the text serialized substrait representation with sql
> dialect 'substrait'?   Or is it an abuse of spec and we should make a
> proper field for IR representations?
>
> I imagine if its possible, interested engines can start working on
> translating to and from substrait, and hopefully it will catch on, but as
> of now Im not sure that Iceberg even provides a standard option?
>
> Thoughts?
> Szehon
>
> On Tue, Oct 22, 2024 at 4:05 AM Will Raschkowski
> <wraschkow...@palantir.com.invalid> wrote:
>
>> It’s all nascent, but we have a relatively solid experience going from
>> Spark to Substrait [1], Substrait to Calcite [2], and Calcite to one of its
>> supported SQL dialects [3]. All these translations are Java-based.
>>
>>
>> [1]: https://github.com/substrait-io/substrait-java/pull/271
>>
>> [2]:
>> https://github.com/substrait-io/substrait-java/blob/main/isthmus/README.md
>>
>> [3]:
>> https://svn.apache.org/repos/asf/calcite/site/apidocs/org/apache/calcite/sql/dialect/package-summary.html
>>
>>
>>
>> *From: *Ajantha Bhat <ajanthab...@gmail.com>
>> *Date: *Tuesday, 22 October 2024 at 08:22
>> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
>> *Subject: *Re: [Discuss] Iceberg View Interoperability
>>
>> *CAUTION:* This email originates from an external party (outside of
>> Palantir). If you believe this message is suspicious in nature, please use
>> the "Report Message" button built into Outlook.
>>
>>
>>
>> Thanks Dan for the reply.
>>
>> but I'm not convinced using a procedure is a good idea or really moves
>> things forward in that direction.
>> I just feel like the procedure approach has a number of drawbacks
>> including, relying on the user to do the translation, being dependent on
>> Spark for defining the other representations (the API would typically be
>> available to you if you're using spark), there are a number of
>> quoting/escaping issues that make it awkward to define, and it introduces a
>> second way to define a view.
>>
>>
>> I didn't consider this as a problem as the procedure is not exposing any
>> new functionalities, users can still achieve this using java API and end up
>> with the same situations mentioned above. Also, not just Spark, each engine
>> can have this procedure.
>> But I see what you mean, having a centralized solution independent of the
>> engine can be better here.
>>
>> I feel like PyIceberg is the right path forward here.
>>
>>
>>
>> If we are using the python based solutions like SQLGlot, it makes sense
>> to use the PyIceberg to expose a view API to translate and add additional
>> dialects from it. But SQLGlot also doesn't support every engine out there
>> (Like Flink, Dremio, Impala etc). So, we may still need an API to manually
>> add the dialect for unsupported engines from python.
>>
>> I am sure we have people here who work on Coral, SQLGlot, Substrait and
>> similar technologies in this mailing list. I will wait for a week to see if
>> there are any opinions on the same. If not, we can start the development
>> with SQLGlot.
>>
>> - Ajantha
>>
>>
>>
>> On Thu, Oct 17, 2024 at 9:32 PM Daniel Weeks <dwe...@apache.org> wrote:
>>
>> Hey Ajantha,
>>
>>
>>
>> I think it's good to figure out a path forward for extending view
>> support, but I'm not convinced using a procedure is a good idea or really
>> moves things forward in that direction.
>>
>>
>>
>> As you already indicated, there are a number of different libraries to
>> translate views, but of the various options, It feels like there are two
>> approaches that resonate with me are either letting the engine perform the
>> translation (like Trino has done with Coral) or supporting views in
>> python [github.com]
>> <https://urldefense.com/v3/__https:/github.com/apache/iceberg-python/issues/818__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JObc743Gc$>
>> so that you can use a combination of libraries like SQLGlot.  Python is the
>> most universally available and is already used in a lot of tooling for
>> running/managing DDL statements.
>>
>>
>>
>> I just feel like the procedure approach has a number of drawbacks
>> including, relying on the user to do the translation, being dependent on
>> Spark for defining the other representations (the API would typically be
>> available to you if you're using spark), there are a number of
>> quoting/escaping issues that make it awkward to define, and it introduces a
>> second way to define a view.
>>
>>
>>
>> Long-term IR (e.g. substrate) is the ideal, but I have questions about
>> whether we'll be able to achieve enough consensus to gain wide adoption.  I
>> feel like PyIceberg is the right path forward here.
>>
>>
>>
>> -Dan
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Oct 17, 2024 at 5:19 AM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>>
>> Hi everyone,
>>
>> It’s been over six months since Iceberg views were introduced (in version
>> 1.5.0), and while we’ve seen some benefits—such as versioning and
>> cross-engine recognition—there’s still a critical gap in terms of true
>> interoperability. Although views created by one engine are recognized by
>> another, they currently cannot be queried across engines as one engine
>> cannot understand the SQL dialect of another engine.
>>
>> To address this, I propose a two-step approach:
>>
>>
>> *Step 1: A temporary solution to manually add SQL dialects. *It is
>> already possible to do this using some java code but most of the view users
>> interact via SQL, there’s currently no SQL-native method to manually add
>> dialects from other engines into Iceberg views. I suggest adding a call
>> procedure to enable this. I've recently worked on this and have submitted
>> the following PRs for review:
>>
>>    - Spark-3.5: Refactor BaseProcedure to support views [#11326
>>    [github.com]
>>    
>> <https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11326__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOcbrMwYs$>
>>    ]
>>    - Spark-3.5: Procedure to add view dialect [#11327 [github.com]
>>    
>> <https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11327__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOdRsXqDS$>
>>    ]
>>
>> With these changes, Iceberg views could be queried across different
>> engines using their respective dialects.
>> Please review and provide feedback.
>>
>>
>> *Step 2: A long-term solution for full view interoperability. *For a
>> sustainable solution, we need to consider both open-source and commercial
>> engines (e.g., Snowflake, Dremio, BigQuery). A framework that can
>> automatically translate SQL dialects from one engine to another seems ideal.
>>
>> I’ve explored Coral*[1]*, which works well with open-source engines like
>> Hive, Trino, and Spark, but doesn’t fully cover commercial engines. In
>> contrast, SQLGlot*[2]* seems to be a better fit, as it supports 23 SQL
>> dialects and would provide a more robust solution for interoperability. But
>> SQLGlot is a python project. So, we need REST or Py4J to work with SQLGlot.
>>
>> Additionally, I looked into Substrait*[3]* as a potential intermediate
>> representation for views, but as per my understanding, integration with
>> commercial engines isn’t quite ready yet. This technology has broader
>> applications beyond views, including table planning, and may be worth
>> revisiting in the future.
>>
>> Given this, I believe that incorporating a translation layer with SQLGlot
>> is the right direction. If no one is currently working on view
>> interoperability, I’d be happy to explore this further and take the lead. I
>> welcome any thoughts or feedback from those involved in related projects.
>>
>> Looking forward to hearing your opinions.
>>
>> [1] https://github.com/linkedin/coral [github.com]
>> <https://urldefense.com/v3/__https:/github.com/linkedin/coral__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOaU9qn72$>
>> [2] https://sqlglot.com/sqlglot.html [sqlglot.com]
>> <https://urldefense.com/v3/__https:/sqlglot.com/sqlglot.html__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOSbK5Ggx$>
>> [3] https://substrait.io/ [substrait.io]
>> <https://urldefense.com/v3/__https:/substrait.io/__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JORcivRrB$>
>>
>> - Ajantha
>>
>>

Re: [Discuss] Iceberg View Interoperability

Reply via email to