Re: [Discuss] Iceberg View Interoperability

Ajantha Bhat Tue, 22 Oct 2024 00:22:23 -0700

Thanks Dan for the reply.

but I'm not convinced using a procedure is a good idea or really moves
> things forward in that direction.
> I just feel like the procedure approach has a number of drawbacks
> including, relying on the user to do the translation, being dependent on
> Spark for defining the other representations (the API would typically be
> available to you if you're using spark), there are a number of
> quoting/escaping issues that make it awkward to define, and it introduces a
> second way to define a view.



I didn't consider this as a problem as the procedure is not exposing any
new functionalities, users can still achieve this using java API and end up
with the same situations mentioned above. Also, not just Spark, each engine
can have this procedure.
But I see what you mean, having a centralized solution independent of the
engine can be better here.

I feel like PyIceberg is the right path forward here.


If we are using the python based solutions like SQLGlot, it makes sense to
use the PyIceberg to expose a view API to translate and add additional
dialects from it. But SQLGlot also doesn't support every engine out there
(Like Flink, Dremio, Impala etc). So, we may still need an API to manually
add the dialect for unsupported engines from python.

I am sure we have people here who work on Coral, SQLGlot, Substrait and
similar technologies in this mailing list. I will wait for a week to see if
there are any opinions on the same. If not, we can start the development
with SQLGlot.

- Ajantha

On Thu, Oct 17, 2024 at 9:32 PM Daniel Weeks <[email protected]> wrote:

> Hey Ajantha,
>
> I think it's good to figure out a path forward for extending view support,
> but I'm not convinced using a procedure is a good idea or really moves
> things forward in that direction.
>
> As you already indicated, there are a number of different libraries to
> translate views, but of the various options, It feels like there are two
> approaches that resonate with me are either letting the engine perform the
> translation (like Trino has done with Coral) or supporting views in python
> <https://github.com/apache/iceberg-python/issues/818> so that you can use
> a combination of libraries like SQLGlot.  Python is the most universally
> available and is already used in a lot of tooling for running/managing DDL
> statements.
>
> I just feel like the procedure approach has a number of drawbacks
> including, relying on the user to do the translation, being dependent on
> Spark for defining the other representations (the API would typically be
> available to you if you're using spark), there are a number of
> quoting/escaping issues that make it awkward to define, and it introduces a
> second way to define a view.
>
> Long-term IR (e.g. substrate) is the ideal, but I have questions about
> whether we'll be able to achieve enough consensus to gain wide adoption.  I
> feel like PyIceberg is the right path forward here.
>
> -Dan
>
>
>
>
>
> On Thu, Oct 17, 2024 at 5:19 AM Ajantha Bhat <[email protected]>
> wrote:
>
>> Hi everyone,
>>
>> It’s been over six months since Iceberg views were introduced (in version
>> 1.5.0), and while we’ve seen some benefits—such as versioning and
>> cross-engine recognition—there’s still a critical gap in terms of true
>> interoperability. Although views created by one engine are recognized by
>> another, they currently cannot be queried across engines as one engine
>> cannot understand the SQL dialect of another engine.
>>
>> To address this, I propose a two-step approach:
>>
>>
>> *Step 1: A temporary solution to manually add SQL dialects.*It is
>> already possible to do this using some java code but most of the view users
>> interact via SQL, there’s currently no SQL-native method to manually add
>> dialects from other engines into Iceberg views. I suggest adding a call
>> procedure to enable this. I've recently worked on this and have submitted
>> the following PRs for review:
>>
>>    - Spark-3.5: Refactor BaseProcedure to support views [#11326
>>    <https://github.com/apache/iceberg/pull/11326>]
>>    - Spark-3.5: Procedure to add view dialect [#11327
>>    <https://github.com/apache/iceberg/pull/11327>]
>>
>> With these changes, Iceberg views could be queried across different
>> engines using their respective dialects.
>> Please review and provide feedback.
>>
>>
>> *Step 2: A long-term solution for full view interoperability.*For a
>> sustainable solution, we need to consider both open-source and commercial
>> engines (e.g., Snowflake, Dremio, BigQuery). A framework that can
>> automatically translate SQL dialects from one engine to another seems ideal.
>>
>> I’ve explored Coral*[1]*, which works well with open-source engines like
>> Hive, Trino, and Spark, but doesn’t fully cover commercial engines. In
>> contrast, SQLGlot*[2]* seems to be a better fit, as it supports 23 SQL
>> dialects and would provide a more robust solution for interoperability. But
>> SQLGlot is a python project. So, we need REST or Py4J to work with SQLGlot.
>>
>> Additionally, I looked into Substrait*[3]* as a potential intermediate
>> representation for views, but as per my understanding, integration with
>> commercial engines isn’t quite ready yet. This technology has broader
>> applications beyond views, including table planning, and may be worth
>> revisiting in the future.
>>
>> Given this, I believe that incorporating a translation layer with SQLGlot
>> is the right direction. If no one is currently working on view
>> interoperability, I’d be happy to explore this further and take the lead. I
>> welcome any thoughts or feedback from those involved in related projects.
>>
>> Looking forward to hearing your opinions.
>>
>> [1] https://github.com/linkedin/coral
>> [2] https://sqlglot.com/sqlglot.html
>> [3] https://substrait.io/
>>
>> - Ajantha
>>
>

Re: [Discuss] Iceberg View Interoperability

Reply via email to