Re: [Discuss] Iceberg View Interoperability

Jacques Nadeau Fri, 29 Nov 2024 19:17:34 -0800

Hey Ajantha, thanks for looping me in. This is a great conversation.

FYI, I'm a co-creator of Substrait so read this all with that in mind.


Substrait has a couple of key underpinnings that are worth noting:
1. It's a specification first and foremost (with tools to help work with
the specification). This is how the project started and continues to work.
Clear specifications are required to succeed at having multiple different
communities and languages work independently but be compatible. This is
inspired by work I've been a part of in Arrow, Iceberg and Parquet.
2. Efficient IR (aka bytecode) is about creating a representation that is
easy for computers to consume and manipulate (as opposed to humans).

On the first item, I think we've already reaped the benefits of this. A
good example is the independent implementations of Substrait in multiple
languages (e.g. Ibis in python, Duckdb in C and Datafusion in Rust,
Calcite/Isthmus in Java). And a fun recent paper by MSFT on their use of
Substrait as a standard plan representation across engines [1]. The success
of the specification is defined by having multiple implementations written
independently (and sometimes competitively). FWIW, I know some lament all
the different Parquet readers and writers written in the world. While there
are probably too many, that's actually a testament to the success of the
specification.

On the second item, I disagree that IR is "just another dialect". Substrait
is built and operates much more like LLVM IR Bytecode or JVM Bytecode.
Swift, Rust and C++ aren't in the same class as LLVM IR. Scala, Kotlin and
Java aren't in the same class as JVM Bytecode. A key characteristic of
formal IRs is deliberate design decisions on the representation for
simplicity, specification and consumption that represents well across a
range of systems. It's a key effort in the Substrait community. Good
concrete examples would be our work to define simple decisions like
disallowing implicit casts as well as more complex topics like how to
represent subqueries [2], specify output type derivation rules [3], and
producing engine independent function compliance tests [4], etc.

WRT to multiple IRs, I don't think there are any others like Substrait
right now. (e.g. (1) well-specified, (2) language agonistic, (3)
serializable, (4) domain specific, (5) system agnostic and (6) built for
"dumb" consumers). I think the most mature attempt prior to Substrait was
probably the GPORCA XML representation [5] (which was used across Hawq &
Greenplum). That project ultimately failed due to single-vendor ownership
and focus on advanced query optimization as opposed to interoperability.
(For what it's worth--when starting on Substrait I spoke to some of the
creators of that project to get feedback on what went well and what
didn't).

I agree with Walaa that any translation engine will need an intermediate
representation to do m=>n translations. In most translation systems, the ir
is something internal that was built over time (what I'd call a "lowercase
ir"). I believe sqlglot and Coral both exist in that category. Both systems
may one day move towards formalizing a representation for input and output.
I think the world would be better if they used an existing standard like
Substrait as opposed to introducing a new standard (it's a lot of work to
get right, let's do it just once as a community). The strengths of tools
like Coral and sqlglot are they are complete products. You can pick each up
and translate SQL today. The strength of Substrait isn't SQL translations,
it's being a clear serializable way to represent data processing
instructions (including SQL).

In many ways I'm reminded of engines that came before Arrow. Each engine
had an internal representation of data. Some were columnar. Some even
looked somewhat like Arrow. When Arrow arrived, they said "we already have
an intermediate representation for data" and they did. People would say
"I'm not going to change my internal representation of data". We never
expected they would. Rewriting an existing engine to a new native format is
near impossible. The goal was new engines would just use Arrow rather than
reinventing the wheel and existing systems would build high-efficiency
adapters to their internal representations. And that's what happened.
Datafusion and Velox didn't need to invent entirely new representations.
When they needed new features they just added them to Arrow (like
Stringview, thanks Meta folks!) I believe Substrait has much the same
opportunity [6]. When people are working on new translations systems, they
use Substrait because it makes things easier. Systems that already have
internal representations pre-Substrait will build adapters over time
(sounds like Coral already has some).

TL;DR, serialization of Iceberg views is exactly the kind of use case that
Substrait was built for and I hope it becomes the main way to get reliable
interoperability. The ecosystem around Substrait (including Coral via
Calcite & Isthmus) is key to turning the specified representation into
something that engines can easily produce/consume. My suggestion would be a
new field that is structured as opposed to overloading a SQL field. Until
Substrait is supported across systems many systems will continue to do what
I'd call "hopeful translation" where they either just run the query (and
hope it works) or run it through a transpiler and hope to get the same
result (type rules be damned). Making it be a hard choice between Substrait
or SQL creates a big impediment to getting from here to where we all want
(reliable consistent views across engines). So I'd say store the SQL and
dialect like now, just start saving the Substrait too.


[1] https://arxiv.org/html/2411.13704v1
[2] https://github.com/substrait-io/substrait/pull/134/files
[3]
https://substrait.io/expressions/scalar_functions/#return-type-expressions
[4]
https://github.com/substrait-io/substrait/tree/main/tests/cases/arithmetic
[5] https://github.com/greenplum-db/gporca-archive
[6]
https://docs.google.com/presentation/d/1HQReIM6uB1Dli_yXfELOJWAE6KsvXAoUmHLlTYZ8laA/edit#slide=id.g150d13b05a9_0_471

On Fri, Nov 29, 2024 at 4:01 PM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> Hi Walaa, thanks for summarizing the questions.
>
> ** If there are interesting applications of introducing an IR in addition
>> to dialects, should Iceberg adopt only one IR as the canonical "Iceberg
>> IR", or should it be able to "represent IRs" in the same way it is able to
>> "represent dialects"?*
>
>
> IMO, having one standardized IR which is widely adopted can help here. If
> it is not widely adopted. It cannot be considered as standardized.
>
> ** What is the unique problem that is solved if Iceberg represents an IR
>> as opposed to representing a SQL dialect? We can keep in mind the following
>> when answering this question:*
>
> Yes we still need conversion if the engines cannot understand IR directly.
> IMO, converting IR to a calcite plan is more efficient than converting SQL
> dialect to each engine's SQL dialect.
>
> I have tagged Jacques on this thread. I am sure he can explain more on why
> Substrait is the suitable choice here.
>
> - Ajantha
>
>
> On Fri, Nov 29, 2024 at 2:30 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Hi Ajantha,
>>
>> I do not clearly see a consensus in this thread. If anything, I see this
>> thread posing more questions than answers. Here is the collection of
>> questions I could distill from the thread:
>>
>> ** What is the unique problem that is solved if Iceberg represents an IR
>> as opposed to representing a SQL dialect? We can keep in mind the following
>> when answering this question:*
>>   ** An IR is a form of a dialect. Dialect is in text form. IR is in
>> structured form.
>>   ** Engines typically use Dialect as their first class citizen. So
>> interoperability is typically between SQL dialects. (IR helps, but not
>> necessarily through "storing" it).
>>   ** Both dialect and IR conversion require translation.
>>   ** Both dialect and IR can be fully specified. For example, the SQL
>> Standard is based on some form of a SQL dialect, not a structured IR.
>>
>> ** If there are interesting applications of introducing an IR in addition
>> to dialects, should Iceberg adopt only one IR as the canonical "Iceberg
>> IR", or should it be able to "represent IRs" in the same way it is able to
>> "represent dialects"?*
>>
>> ** If the answer is to adopt a single IR, what is the framework/criteria
>> to design or choose that IR?*
>>   ** Is it serializability, expressibility, or translatability?
>>   ** How do we score the IRs against this criteria?
>>
>> ** If the answer is to support representing multiple IRs, the type of
>> problems Iceberg would be concerned with will be different. We may have to
>> think about different types of questions in this case.*
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Mon, Nov 4, 2024 at 8:40 AM Matt Topol <zotthewiz...@gmail.com> wrote:
>>
>>> For reference, there are two reasons why I chose to add that
>>> substrait.go:
>>>
>>> 1) The Golang Arrow implementation has a compute package which is able
>>> to evaluate substrait expressions as long as the kernels exist in the
>>> package.
>>>
>>> 2) Along the lines of this conversation, I wanted to be able to
>>> generically create Substrait expressions from iceberg expressions. With the
>>> goal being that the go implementation could potentially be able to create a
>>> full substrait plan (including the reading) from an iceberg table (and
>>> metadata) and expression. Eventually the plan would be able to be sent to a
>>> compute engine which wouldn't have to know anything about iceberg to
>>> execute it!
>>>
>>> On Mon, Nov 4, 2024, 5:34 PM Fokko Driesprong <fo...@apache.org> wrote:
>>>
>>>> Matt also just added `substrait.go` to the Iceberg-Go implementation
>>>> that I was reviewing today:
>>>>
>>>> https://github.com/apache/iceberg-go/pull/185/files#diff-81cfac9f2e1dcf6265c569d0a3397964f0b78e07f45bb9dcdd3effe0623aaf73
>>>>
>>>> That converts an Iceberg expression into a substrate one, pretty
>>>> exciting stuff
>>>>
>>>> Kind regards,
>>>> Fokko
>>>>
>>>> Op ma 4 nov 2024 om 14:03 schreef Jean-Baptiste Onofré <j...@nanthrax.net
>>>> >:
>>>>
>>>>> Hi Ajantha,
>>>>>
>>>>> During CommunityOverCode, I chatted with Matt Topol about Substrait
>>>>> and ADBC.
>>>>> I checked the Substrait support in DataFusion and it's interesting.
>>>>>
>>>>> I was thinking about where to actually store the Substrait plan (I was
>>>>> thinking about an intermediate SQL representation that we could store
>>>>> as a metadata instead of directly the plan).
>>>>>
>>>>> Maybe, we could start with a proposal document to explore the
>>>>> different options (and so follow Iceberg proposals process, creating a
>>>>> GitHub Issue with the proposal tag, and attaching the document) ?
>>>>>
>>>>> Thanks !
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Mon, Nov 4, 2024 at 10:38 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Thanks everyone for the detailed discussions.
>>>>> >
>>>>> > Looks like we have consensus towards Substrait.
>>>>> > Last time I checked it was not adopted by all the engines. But we
>>>>> can work towards the adoption as well.
>>>>> >
>>>>> > I will explore further on Substrait and come up with the design doc
>>>>> on the same.
>>>>> >
>>>>> > Thanks,
>>>>> > Ajantha
>>>>> >
>>>>> > On Mon, Oct 28, 2024 at 11:20 PM Amogh Jahagirdar <2am...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> Hey all,
>>>>> >>
>>>>> >> I'm +1 in efforts to make views more interoperable across engines
>>>>> as I believe such efforts would be beneficial for the wider ecosystem. I
>>>>> think the way to do that is through higher fidelity IRs such as Substrait.
>>>>> >>
>>>>> >> I agree with Walaa that there's not really a valid distinction
>>>>> between IR vs non-IR projects when it comes to translation; my
>>>>> understanding is that in the end any translation framework would have to
>>>>> normalize to an intermediate representation. With the SQLGlot case, it's
>>>>> just that the IR is at the AST level and with the others they have higher
>>>>> fidelity to capture more accurate query semantics (correct me if I'm wrong
>>>>> here). As of today, it is already possible to use SQLGlot, translate to 
>>>>> the
>>>>> desired SQL and store these SQL representations. However, since it's not 
>>>>> as
>>>>> high fidelity as a proper IR layer, there are issues to consider like 
>>>>> Fokko
>>>>> mentioned; but again, if users are happy with their results, they can do
>>>>> this today without any spec changes.
>>>>> >>
>>>>> >> In my opinion, the biggest hurdle for Substrait or any other IR to
>>>>> be a viable standard in Iceberg that's worth maintaining is that there
>>>>> would need to be consensus across different engine/language communities
>>>>> (e.g. Walaa referenced the Trino community's perspective on such IR
>>>>> layers). Otherwise it risks becoming something that's defined in the
>>>>> standard but really isn't well accepted which I think we all want to 
>>>>> avoid.
>>>>> >>
>>>>> >> I think as a starting point, it would be great to sync with at
>>>>> least OSS engines/language communities and try and understand any concrete
>>>>> points of skepticism for considering such a standard. So far a lot of the
>>>>> points of skepticism as I read it are around such a layer being only
>>>>> considerate of 1 engine or having such substantial feature gaps that it
>>>>> can't be considered; but no concrete cases have been called out.
>>>>> >> Once we establish concrete gaps, I think then it would make sense
>>>>> to work with the respective IR community to help close those gaps or if
>>>>> needed consider other paths.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Amogh Jahagirdar
>>>>> >>
>>>>> >> On Mon, Oct 28, 2024 at 11:43 AM Piotr Findeisen <
>>>>> piotr.findei...@gmail.com> wrote:
>>>>> >>>
>>>>> >>> Hi,
>>>>> >>>
>>>>> >>> I have no experience with Substrait, but i agree it looks like the
>>>>> tool for the job.
>>>>> >>> Or, as I proposed earlier, we define our own Iceberg IR for the
>>>>> views.
>>>>> >>>
>>>>> >>> We can experiment with serialized IR being stored as a String with
>>>>> new dialect name, and this is how we should get this started.
>>>>> >>> It's probably good end solution as well, but the important
>>>>> value-add is if we manage to converge towards one shared IR that's "native
>>>>> to iceberg".
>>>>> >>> This would be best for the users -- more views would just work.
>>>>> >>> And best for long-term evolution of the project -- standardized IR
>>>>> would help things like incremental refreshes (for materialized views).
>>>>> >>>
>>>>> >>> Best
>>>>> >>> Piotr
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>> >>>>
>>>>> >>>> Hi Fokko,
>>>>> >>>>
>>>>> >>>> We can implement Python/Rust/Go clients to interop with the
>>>>> serialized Coral IR. Not sure if it makes sense to have all front-end and
>>>>> back-end implementations (e.g., Spark to Coral IR or Coral IR to Trino,
>>>>> etc) be reimplemented in those languages. Such implementations actually
>>>>> depend on the reuse of the native parsers of those dialects which are
>>>>> typically in Java (also this is to your point about the language coverage
>>>>> -- reusing native parsers is a principle that Coral follows to be 
>>>>> compliant
>>>>> with the source dialect). I think making Python/Rust/Go interop/handle the
>>>>> IR (i.e., convert the serialized IR to in-memory IR and the other way
>>>>> around) would be a good start. For example, Python-specific backends and
>>>>> front-end implementations can follow from that.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> Walaa.
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <
>>>>> fo...@apache.org> wrote:
>>>>> >>>>>
>>>>> >>>>> Hey everyone,
>>>>> >>>>>
>>>>> >>>>> Views in PyIceberg are not yet as mature as in Java, mostly
>>>>> because tooling in Python tends to work with data frames, rather than SQL.
>>>>> I do think it would be valuable to extend support there.
>>>>> >>>>>
>>>>> >>>>> I have a bit of experience in turning SQL into ASTs and
>>>>> extending grammar, and I'm confident to say that it is nearly impossible 
>>>>> to
>>>>> cover all the grammar of a specific dialect. My main question is, what 
>>>>> will
>>>>> SQLGlot do when we try to translate a dialect that it doesn't fully
>>>>> understand? Will it error out, or will it produce faulty SQL? A simple
>>>>> example can be functions that are not supported in other engines up to
>>>>> recursive CTE's. In this case, not failing upfront would cause correctness
>>>>> issues.
>>>>> >>>>>
>>>>> >>>>> Regarding Substrait. Within PyIceberg there was also successful
>>>>> experimentation of having a DuckDB query, sending it to PyIceberg to do 
>>>>> the
>>>>> Iceberg query planning, and returning a physical plan to DuckDB to do the
>>>>> actual execution. This was still an early stage and required a lot of work
>>>>> around credentials and field-IDs, but it was quite promising. Using
>>>>> Substrait as views looks easier to me, and would also translate to a
>>>>> dataframe-based world. Walaa, do you have any outlook on Coral
>>>>> Python/Rust/Go support?
>>>>> >>>>>
>>>>> >>>>> Kind regards,
>>>>> >>>>> Fokko
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com>:
>>>>> >>>>>>
>>>>> >>>>>> I think this may need some more discussion.
>>>>> >>>>>>
>>>>> >>>>>> To me, a "serialized IR" is another form of a "dialect". In
>>>>> this case, this dialect will be mostly specific to Iceberg, and compute
>>>>> engines will still support reading views in their native SQL. There are
>>>>> some data points on this from the Trino community in a previous discussion
>>>>> [1]. In addition to being not directly consumable by engines, a serialized
>>>>> IR will be hard to consume by humans too.
>>>>> >>>>>>
>>>>> >>>>>> From that perspective, even if Iceberg adopts some form of a
>>>>> serialized IR, we will end up again doing translation, from that IR to the
>>>>> engine's dialect on view read time, and from the engine's dialect to that
>>>>> IR on the view write time. So serialized IR cannot eliminate translation.
>>>>> >>>>>>
>>>>> >>>>>> I think it is better to not quickly adopt the serialized IR
>>>>> path until it is proven to work and there is sufficient tooling and 
>>>>> support
>>>>> around it, else it will end up being a constraint.
>>>>> >>>>>>
>>>>> >>>>>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are
>>>>> some fundamental differences between their approaches, mainly around the
>>>>> intermediate representation abstraction. Coral models both the AST and the
>>>>> logical plan of a query, making it able to capture the query semantics 
>>>>> more
>>>>> accurately and hence perform precise transformations. On the flip side,
>>>>> SQLGlot abstraction is at the AST level only. Data type inference would be
>>>>> a major gap in any solution that does not capture the logical plan for
>>>>> example, yet very important to perform successful translation. This is
>>>>> backed up by some experiments we performed on actual queries and their
>>>>> translation results (from Spark to Trino, comparing results of Coral and
>>>>> SQLGlot).
>>>>> >>>>>>
>>>>> >>>>>> For the IR: Any translation solution (including Coral) must
>>>>> rely on an IR, and it has to be decoupled from any of the input and output
>>>>> dialects. This is true in the Coral case today. Such IR is the way to
>>>>> represent both the intermediate AST and logical plans. Therefore, I do not
>>>>> think we can necessarily split projects as "IR projects" vs not, since all
>>>>> solutions must use an IR. With that said, IR serialization is a matter of
>>>>> staging/milestones of the project. Serialized IR is next on Coral's
>>>>> roadmap. If Iceberg ends up adopting an IR, it might be a good idea to 
>>>>> make
>>>>> Iceberg interoperable with a Coral-based serialized IR. This will make the
>>>>> compatibility with engines that adopt Coral (like Trino) much more robust
>>>>> and straightforward.
>>>>> >>>>>>
>>>>> >>>>>> [1]
>>>>> https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002
>>>>> >>>>>>
>>>>> >>>>>> Thanks,
>>>>> >>>>>> Walaa.
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>>
>>>>

Re: [Discuss] Iceberg View Interoperability

Reply via email to