Re: [Discuss] Iceberg View Interoperability

Szehon Ho Fri, 25 Oct 2024 12:31:27 -0700

+1 on this, should we add an IR field / type to View
<https://iceberg.apache.org/view-spec/> spec
<https://iceberg.apache.org/view-spec/> then?  Sorry if its discussed
already, catching up a bit on the context.


Thanks
Szehon

On Fri, Oct 25, 2024 at 11:51 AM rdb...@gmail.com <rdb...@gmail.com> wrote:

> Substrait is one of the reasons why we designed views with the ability to
> have different representations. I think that SQL translation is not a great
> solution. I'd like to see more focus on a portable intermediate
> representation like Substrait. That would solve a lot of the limitations
> with the SQL representation (like not wanting to rewrite after resolution).
>
> On Fri, Oct 25, 2024 at 10:21 AM Szehon Ho <szehon.apa...@gmail.com>
> wrote:
>
>> Im dont have hands on experience on Substrait, but wondering, is
>> substrait representation possible today with existing Iceberg view spec?
>> Ie, engines can store today the text serialized substrait representation
>> with sql dialect 'substrait'?   Or is it an abuse of spec and we should
>> make a proper field for IR representations?
>>
>> I imagine if its possible, interested engines can start working on
>> translating to and from substrait, and hopefully it will catch on, but as
>> of now Im not sure that Iceberg even provides a standard option?
>>
>> Thoughts?
>> Szehon
>>
>> On Tue, Oct 22, 2024 at 4:05 AM Will Raschkowski
>> <wraschkow...@palantir.com.invalid> wrote:
>>
>>> It’s all nascent, but we have a relatively solid experience going from
>>> Spark to Substrait [1], Substrait to Calcite [2], and Calcite to one of its
>>> supported SQL dialects [3]. All these translations are Java-based.
>>>
>>>
>>> [1]: https://github.com/substrait-io/substrait-java/pull/271
>>>
>>> [2]:
>>> https://github.com/substrait-io/substrait-java/blob/main/isthmus/README.md
>>>
>>> [3]:
>>> https://svn.apache.org/repos/asf/calcite/site/apidocs/org/apache/calcite/sql/dialect/package-summary.html
>>>
>>>
>>>
>>> *From: *Ajantha Bhat <ajanthab...@gmail.com>
>>> *Date: *Tuesday, 22 October 2024 at 08:22
>>> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
>>> *Subject: *Re: [Discuss] Iceberg View Interoperability
>>>
>>> *CAUTION:* This email originates from an external party (outside of
>>> Palantir). If you believe this message is suspicious in nature, please use
>>> the "Report Message" button built into Outlook.
>>>
>>>
>>>
>>> Thanks Dan for the reply.
>>>
>>> but I'm not convinced using a procedure is a good idea or really moves
>>> things forward in that direction.
>>> I just feel like the procedure approach has a number of drawbacks
>>> including, relying on the user to do the translation, being dependent on
>>> Spark for defining the other representations (the API would typically be
>>> available to you if you're using spark), there are a number of
>>> quoting/escaping issues that make it awkward to define, and it introduces a
>>> second way to define a view.
>>>
>>>
>>> I didn't consider this as a problem as the procedure is not exposing any
>>> new functionalities, users can still achieve this using java API and end up
>>> with the same situations mentioned above. Also, not just Spark, each engine
>>> can have this procedure.
>>> But I see what you mean, having a centralized solution independent of
>>> the engine can be better here.
>>>
>>> I feel like PyIceberg is the right path forward here.
>>>
>>>
>>>
>>> If we are using the python based solutions like SQLGlot, it makes sense
>>> to use the PyIceberg to expose a view API to translate and add additional
>>> dialects from it. But SQLGlot also doesn't support every engine out there
>>> (Like Flink, Dremio, Impala etc). So, we may still need an API to manually
>>> add the dialect for unsupported engines from python.
>>>
>>> I am sure we have people here who work on Coral, SQLGlot, Substrait and
>>> similar technologies in this mailing list. I will wait for a week to see if
>>> there are any opinions on the same. If not, we can start the development
>>> with SQLGlot.
>>>
>>> - Ajantha
>>>
>>>
>>>
>>> On Thu, Oct 17, 2024 at 9:32 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>> Hey Ajantha,
>>>
>>>
>>>
>>> I think it's good to figure out a path forward for extending view
>>> support, but I'm not convinced using a procedure is a good idea or really
>>> moves things forward in that direction.
>>>
>>>
>>>
>>> As you already indicated, there are a number of different libraries to
>>> translate views, but of the various options, It feels like there are two
>>> approaches that resonate with me are either letting the engine perform the
>>> translation (like Trino has done with Coral) or supporting views in
>>> python [github.com]
>>> <https://urldefense.com/v3/__https:/github.com/apache/iceberg-python/issues/818__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JObc743Gc$>
>>> so that you can use a combination of libraries like SQLGlot.  Python is the
>>> most universally available and is already used in a lot of tooling for
>>> running/managing DDL statements.
>>>
>>>
>>>
>>> I just feel like the procedure approach has a number of drawbacks
>>> including, relying on the user to do the translation, being dependent on
>>> Spark for defining the other representations (the API would typically be
>>> available to you if you're using spark), there are a number of
>>> quoting/escaping issues that make it awkward to define, and it introduces a
>>> second way to define a view.
>>>
>>>
>>>
>>> Long-term IR (e.g. substrate) is the ideal, but I have questions about
>>> whether we'll be able to achieve enough consensus to gain wide adoption.  I
>>> feel like PyIceberg is the right path forward here.
>>>
>>>
>>>
>>> -Dan
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Oct 17, 2024 at 5:19 AM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> It’s been over six months since Iceberg views were introduced (in
>>> version 1.5.0), and while we’ve seen some benefits—such as versioning and
>>> cross-engine recognition—there’s still a critical gap in terms of true
>>> interoperability. Although views created by one engine are recognized by
>>> another, they currently cannot be queried across engines as one engine
>>> cannot understand the SQL dialect of another engine.
>>>
>>> To address this, I propose a two-step approach:
>>>
>>>
>>> *Step 1: A temporary solution to manually add SQL dialects. *It is
>>> already possible to do this using some java code but most of the view users
>>> interact via SQL, there’s currently no SQL-native method to manually add
>>> dialects from other engines into Iceberg views. I suggest adding a call
>>> procedure to enable this. I've recently worked on this and have submitted
>>> the following PRs for review:
>>>
>>>    - Spark-3.5: Refactor BaseProcedure to support views [#11326
>>>    [github.com]
>>>    
>>> <https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11326__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOcbrMwYs$>
>>>    ]
>>>    - Spark-3.5: Procedure to add view dialect [#11327 [github.com]
>>>    
>>> <https://urldefense.com/v3/__https:/github.com/apache/iceberg/pull/11327__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOdRsXqDS$>
>>>    ]
>>>
>>> With these changes, Iceberg views could be queried across different
>>> engines using their respective dialects.
>>> Please review and provide feedback.
>>>
>>>
>>> *Step 2: A long-term solution for full view interoperability. *For a
>>> sustainable solution, we need to consider both open-source and commercial
>>> engines (e.g., Snowflake, Dremio, BigQuery). A framework that can
>>> automatically translate SQL dialects from one engine to another seems ideal.
>>>
>>> I’ve explored Coral*[1]*, which works well with open-source engines
>>> like Hive, Trino, and Spark, but doesn’t fully cover commercial engines. In
>>> contrast, SQLGlot*[2]* seems to be a better fit, as it supports 23 SQL
>>> dialects and would provide a more robust solution for interoperability. But
>>> SQLGlot is a python project. So, we need REST or Py4J to work with SQLGlot.
>>>
>>> Additionally, I looked into Substrait*[3]* as a potential intermediate
>>> representation for views, but as per my understanding, integration with
>>> commercial engines isn’t quite ready yet. This technology has broader
>>> applications beyond views, including table planning, and may be worth
>>> revisiting in the future.
>>>
>>> Given this, I believe that incorporating a translation layer with
>>> SQLGlot is the right direction. If no one is currently working on view
>>> interoperability, I’d be happy to explore this further and take the lead. I
>>> welcome any thoughts or feedback from those involved in related projects.
>>>
>>> Looking forward to hearing your opinions.
>>>
>>> [1] https://github.com/linkedin/coral [github.com]
>>> <https://urldefense.com/v3/__https:/github.com/linkedin/coral__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOaU9qn72$>
>>> [2] https://sqlglot.com/sqlglot.html [sqlglot.com]
>>> <https://urldefense.com/v3/__https:/sqlglot.com/sqlglot.html__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JOSbK5Ggx$>
>>> [3] https://substrait.io/ [substrait.io]
>>> <https://urldefense.com/v3/__https:/substrait.io/__;!!NkS9JGVQ2sDq!8bASoPc-OIE9M3O3F6FPExuXHCZjdf6v32IN8GkYLTwwfEGmZcpYq8QAmmFAE0tfLgtOJDXpv7C7gQ-JORcivRrB$>
>>>
>>> - Ajantha
>>>
>>>

Re: [Discuss] Iceberg View Interoperability

Reply via email to