Re: [Discuss] Iceberg View Interoperability

Ajantha Bhat Fri, 29 Nov 2024 18:01:43 -0800

Hi Walaa, thanks for summarizing the questions.

** If there are interesting applications of introducing an IR in addition
> to dialects, should Iceberg adopt only one IR as the canonical "Iceberg
> IR", or should it be able to "represent IRs" in the same way it is able to
> "represent dialects"?*



IMO, having one standardized IR which is widely adopted can help here. If
it is not widely adopted. It cannot be considered as standardized.

** What is the unique problem that is solved if Iceberg represents an IR as
> opposed to representing a SQL dialect? We can keep in mind the following
> when answering this question:*

Yes we still need conversion if the engines cannot understand IR directly.
IMO, converting IR to a calcite plan is more efficient than converting SQL
dialect to each engine's SQL dialect.

I have tagged Jacques on this thread. I am sure he can explain more on why
Substrait is the suitable choice here.

- Ajantha


On Fri, Nov 29, 2024 at 2:30 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Hi Ajantha,
>
> I do not clearly see a consensus in this thread. If anything, I see this
> thread posing more questions than answers. Here is the collection of
> questions I could distill from the thread:
>
> ** What is the unique problem that is solved if Iceberg represents an IR
> as opposed to representing a SQL dialect? We can keep in mind the following
> when answering this question:*
>   ** An IR is a form of a dialect. Dialect is in text form. IR is in
> structured form.
>   ** Engines typically use Dialect as their first class citizen. So
> interoperability is typically between SQL dialects. (IR helps, but not
> necessarily through "storing" it).
>   ** Both dialect and IR conversion require translation.
>   ** Both dialect and IR can be fully specified. For example, the SQL
> Standard is based on some form of a SQL dialect, not a structured IR.
>
> ** If there are interesting applications of introducing an IR in addition
> to dialects, should Iceberg adopt only one IR as the canonical "Iceberg
> IR", or should it be able to "represent IRs" in the same way it is able to
> "represent dialects"?*
>
> ** If the answer is to adopt a single IR, what is the framework/criteria
> to design or choose that IR?*
>   ** Is it serializability, expressibility, or translatability?
>   ** How do we score the IRs against this criteria?
>
> ** If the answer is to support representing multiple IRs, the type of
> problems Iceberg would be concerned with will be different. We may have to
> think about different types of questions in this case.*
>
> Thanks,
> Walaa.
>
>
> On Mon, Nov 4, 2024 at 8:40 AM Matt Topol <zotthewiz...@gmail.com> wrote:
>
>> For reference, there are two reasons why I chose to add that substrait.go:
>>
>> 1) The Golang Arrow implementation has a compute package which is able to
>> evaluate substrait expressions as long as the kernels exist in the package.
>>
>> 2) Along the lines of this conversation, I wanted to be able to
>> generically create Substrait expressions from iceberg expressions. With the
>> goal being that the go implementation could potentially be able to create a
>> full substrait plan (including the reading) from an iceberg table (and
>> metadata) and expression. Eventually the plan would be able to be sent to a
>> compute engine which wouldn't have to know anything about iceberg to
>> execute it!
>>
>> On Mon, Nov 4, 2024, 5:34 PM Fokko Driesprong <fo...@apache.org> wrote:
>>
>>> Matt also just added `substrait.go` to the Iceberg-Go implementation
>>> that I was reviewing today:
>>>
>>> https://github.com/apache/iceberg-go/pull/185/files#diff-81cfac9f2e1dcf6265c569d0a3397964f0b78e07f45bb9dcdd3effe0623aaf73
>>>
>>> That converts an Iceberg expression into a substrate one, pretty
>>> exciting stuff
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op ma 4 nov 2024 om 14:03 schreef Jean-Baptiste Onofré <j...@nanthrax.net
>>> >:
>>>
>>>> Hi Ajantha,
>>>>
>>>> During CommunityOverCode, I chatted with Matt Topol about Substrait and
>>>> ADBC.
>>>> I checked the Substrait support in DataFusion and it's interesting.
>>>>
>>>> I was thinking about where to actually store the Substrait plan (I was
>>>> thinking about an intermediate SQL representation that we could store
>>>> as a metadata instead of directly the plan).
>>>>
>>>> Maybe, we could start with a proposal document to explore the
>>>> different options (and so follow Iceberg proposals process, creating a
>>>> GitHub Issue with the proposal tag, and attaching the document) ?
>>>>
>>>> Thanks !
>>>> Regards
>>>> JB
>>>>
>>>> On Mon, Nov 4, 2024 at 10:38 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Thanks everyone for the detailed discussions.
>>>> >
>>>> > Looks like we have consensus towards Substrait.
>>>> > Last time I checked it was not adopted by all the engines. But we can
>>>> work towards the adoption as well.
>>>> >
>>>> > I will explore further on Substrait and come up with the design doc
>>>> on the same.
>>>> >
>>>> > Thanks,
>>>> > Ajantha
>>>> >
>>>> > On Mon, Oct 28, 2024 at 11:20 PM Amogh Jahagirdar <2am...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Hey all,
>>>> >>
>>>> >> I'm +1 in efforts to make views more interoperable across engines as
>>>> I believe such efforts would be beneficial for the wider ecosystem. I think
>>>> the way to do that is through higher fidelity IRs such as Substrait.
>>>> >>
>>>> >> I agree with Walaa that there's not really a valid distinction
>>>> between IR vs non-IR projects when it comes to translation; my
>>>> understanding is that in the end any translation framework would have to
>>>> normalize to an intermediate representation. With the SQLGlot case, it's
>>>> just that the IR is at the AST level and with the others they have higher
>>>> fidelity to capture more accurate query semantics (correct me if I'm wrong
>>>> here). As of today, it is already possible to use SQLGlot, translate to the
>>>> desired SQL and store these SQL representations. However, since it's not as
>>>> high fidelity as a proper IR layer, there are issues to consider like Fokko
>>>> mentioned; but again, if users are happy with their results, they can do
>>>> this today without any spec changes.
>>>> >>
>>>> >> In my opinion, the biggest hurdle for Substrait or any other IR to
>>>> be a viable standard in Iceberg that's worth maintaining is that there
>>>> would need to be consensus across different engine/language communities
>>>> (e.g. Walaa referenced the Trino community's perspective on such IR
>>>> layers). Otherwise it risks becoming something that's defined in the
>>>> standard but really isn't well accepted which I think we all want to avoid.
>>>> >>
>>>> >> I think as a starting point, it would be great to sync with at least
>>>> OSS engines/language communities and try and understand any concrete points
>>>> of skepticism for considering such a standard. So far a lot of the points
>>>> of skepticism as I read it are around such a layer being only considerate
>>>> of 1 engine or having such substantial feature gaps that it can't be
>>>> considered; but no concrete cases have been called out.
>>>> >> Once we establish concrete gaps, I think then it would make sense to
>>>> work with the respective IR community to help close those gaps or if needed
>>>> consider other paths.
>>>> >>
>>>> >> Thanks,
>>>> >> Amogh Jahagirdar
>>>> >>
>>>> >> On Mon, Oct 28, 2024 at 11:43 AM Piotr Findeisen <
>>>> piotr.findei...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I have no experience with Substrait, but i agree it looks like the
>>>> tool for the job.
>>>> >>> Or, as I proposed earlier, we define our own Iceberg IR for the
>>>> views.
>>>> >>>
>>>> >>> We can experiment with serialized IR being stored as a String with
>>>> new dialect name, and this is how we should get this started.
>>>> >>> It's probably good end solution as well, but the important
>>>> value-add is if we manage to converge towards one shared IR that's "native
>>>> to iceberg".
>>>> >>> This would be best for the users -- more views would just work.
>>>> >>> And best for long-term evolution of the project -- standardized IR
>>>> would help things like incremental refreshes (for materialized views).
>>>> >>>
>>>> >>> Best
>>>> >>> Piotr
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>> >>>>
>>>> >>>> Hi Fokko,
>>>> >>>>
>>>> >>>> We can implement Python/Rust/Go clients to interop with the
>>>> serialized Coral IR. Not sure if it makes sense to have all front-end and
>>>> back-end implementations (e.g., Spark to Coral IR or Coral IR to Trino,
>>>> etc) be reimplemented in those languages. Such implementations actually
>>>> depend on the reuse of the native parsers of those dialects which are
>>>> typically in Java (also this is to your point about the language coverage
>>>> -- reusing native parsers is a principle that Coral follows to be compliant
>>>> with the source dialect). I think making Python/Rust/Go interop/handle the
>>>> IR (i.e., convert the serialized IR to in-memory IR and the other way
>>>> around) would be a good start. For example, Python-specific backends and
>>>> front-end implementations can follow from that.
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> Walaa.
>>>> >>>>
>>>> >>>>
>>>> >>>> On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> Hey everyone,
>>>> >>>>>
>>>> >>>>> Views in PyIceberg are not yet as mature as in Java, mostly
>>>> because tooling in Python tends to work with data frames, rather than SQL.
>>>> I do think it would be valuable to extend support there.
>>>> >>>>>
>>>> >>>>> I have a bit of experience in turning SQL into ASTs and extending
>>>> grammar, and I'm confident to say that it is nearly impossible to cover all
>>>> the grammar of a specific dialect. My main question is, what will SQLGlot
>>>> do when we try to translate a dialect that it doesn't fully understand?
>>>> Will it error out, or will it produce faulty SQL? A simple example can be
>>>> functions that are not supported in other engines up to recursive CTE's. In
>>>> this case, not failing upfront would cause correctness issues.
>>>> >>>>>
>>>> >>>>> Regarding Substrait. Within PyIceberg there was also successful
>>>> experimentation of having a DuckDB query, sending it to PyIceberg to do the
>>>> Iceberg query planning, and returning a physical plan to DuckDB to do the
>>>> actual execution. This was still an early stage and required a lot of work
>>>> around credentials and field-IDs, but it was quite promising. Using
>>>> Substrait as views looks easier to me, and would also translate to a
>>>> dataframe-based world. Walaa, do you have any outlook on Coral
>>>> Python/Rust/Go support?
>>>> >>>>>
>>>> >>>>> Kind regards,
>>>> >>>>> Fokko
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com>:
>>>> >>>>>>
>>>> >>>>>> I think this may need some more discussion.
>>>> >>>>>>
>>>> >>>>>> To me, a "serialized IR" is another form of a "dialect". In this
>>>> case, this dialect will be mostly specific to Iceberg, and compute engines
>>>> will still support reading views in their native SQL. There are some data
>>>> points on this from the Trino community in a previous discussion [1]. In
>>>> addition to being not directly consumable by engines, a serialized IR will
>>>> be hard to consume by humans too.
>>>> >>>>>>
>>>> >>>>>> From that perspective, even if Iceberg adopts some form of a
>>>> serialized IR, we will end up again doing translation, from that IR to the
>>>> engine's dialect on view read time, and from the engine's dialect to that
>>>> IR on the view write time. So serialized IR cannot eliminate translation.
>>>> >>>>>>
>>>> >>>>>> I think it is better to not quickly adopt the serialized IR path
>>>> until it is proven to work and there is sufficient tooling and support
>>>> around it, else it will end up being a constraint.
>>>> >>>>>>
>>>> >>>>>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are
>>>> some fundamental differences between their approaches, mainly around the
>>>> intermediate representation abstraction. Coral models both the AST and the
>>>> logical plan of a query, making it able to capture the query semantics more
>>>> accurately and hence perform precise transformations. On the flip side,
>>>> SQLGlot abstraction is at the AST level only. Data type inference would be
>>>> a major gap in any solution that does not capture the logical plan for
>>>> example, yet very important to perform successful translation. This is
>>>> backed up by some experiments we performed on actual queries and their
>>>> translation results (from Spark to Trino, comparing results of Coral and
>>>> SQLGlot).
>>>> >>>>>>
>>>> >>>>>> For the IR: Any translation solution (including Coral) must rely
>>>> on an IR, and it has to be decoupled from any of the input and output
>>>> dialects. This is true in the Coral case today. Such IR is the way to
>>>> represent both the intermediate AST and logical plans. Therefore, I do not
>>>> think we can necessarily split projects as "IR projects" vs not, since all
>>>> solutions must use an IR. With that said, IR serialization is a matter of
>>>> staging/milestones of the project. Serialized IR is next on Coral's
>>>> roadmap. If Iceberg ends up adopting an IR, it might be a good idea to make
>>>> Iceberg interoperable with a Coral-based serialized IR. This will make the
>>>> compatibility with engines that adopt Coral (like Trino) much more robust
>>>> and straightforward.
>>>> >>>>>>
>>>> >>>>>> [1]
>>>> https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002
>>>> >>>>>>
>>>> >>>>>> Thanks,
>>>> >>>>>> Walaa.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>>
>>>

Re: [Discuss] Iceberg View Interoperability

Reply via email to