Hi Walaa, thanks for summarizing the questions. ** If there are interesting applications of introducing an IR in addition > to dialects, should Iceberg adopt only one IR as the canonical "Iceberg > IR", or should it be able to "represent IRs" in the same way it is able to > "represent dialects"?*
IMO, having one standardized IR which is widely adopted can help here. If it is not widely adopted. It cannot be considered as standardized. ** What is the unique problem that is solved if Iceberg represents an IR as > opposed to representing a SQL dialect? We can keep in mind the following > when answering this question:* Yes we still need conversion if the engines cannot understand IR directly. IMO, converting IR to a calcite plan is more efficient than converting SQL dialect to each engine's SQL dialect. I have tagged Jacques on this thread. I am sure he can explain more on why Substrait is the suitable choice here. - Ajantha On Fri, Nov 29, 2024 at 2:30 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Hi Ajantha, > > I do not clearly see a consensus in this thread. If anything, I see this > thread posing more questions than answers. Here is the collection of > questions I could distill from the thread: > > ** What is the unique problem that is solved if Iceberg represents an IR > as opposed to representing a SQL dialect? We can keep in mind the following > when answering this question:* > ** An IR is a form of a dialect. Dialect is in text form. IR is in > structured form. > ** Engines typically use Dialect as their first class citizen. So > interoperability is typically between SQL dialects. (IR helps, but not > necessarily through "storing" it). > ** Both dialect and IR conversion require translation. > ** Both dialect and IR can be fully specified. For example, the SQL > Standard is based on some form of a SQL dialect, not a structured IR. > > ** If there are interesting applications of introducing an IR in addition > to dialects, should Iceberg adopt only one IR as the canonical "Iceberg > IR", or should it be able to "represent IRs" in the same way it is able to > "represent dialects"?* > > ** If the answer is to adopt a single IR, what is the framework/criteria > to design or choose that IR?* > ** Is it serializability, expressibility, or translatability? > ** How do we score the IRs against this criteria? > > ** If the answer is to support representing multiple IRs, the type of > problems Iceberg would be concerned with will be different. We may have to > think about different types of questions in this case.* > > Thanks, > Walaa. > > > On Mon, Nov 4, 2024 at 8:40 AM Matt Topol <zotthewiz...@gmail.com> wrote: > >> For reference, there are two reasons why I chose to add that substrait.go: >> >> 1) The Golang Arrow implementation has a compute package which is able to >> evaluate substrait expressions as long as the kernels exist in the package. >> >> 2) Along the lines of this conversation, I wanted to be able to >> generically create Substrait expressions from iceberg expressions. With the >> goal being that the go implementation could potentially be able to create a >> full substrait plan (including the reading) from an iceberg table (and >> metadata) and expression. Eventually the plan would be able to be sent to a >> compute engine which wouldn't have to know anything about iceberg to >> execute it! >> >> On Mon, Nov 4, 2024, 5:34 PM Fokko Driesprong <fo...@apache.org> wrote: >> >>> Matt also just added `substrait.go` to the Iceberg-Go implementation >>> that I was reviewing today: >>> >>> https://github.com/apache/iceberg-go/pull/185/files#diff-81cfac9f2e1dcf6265c569d0a3397964f0b78e07f45bb9dcdd3effe0623aaf73 >>> >>> That converts an Iceberg expression into a substrate one, pretty >>> exciting stuff >>> >>> Kind regards, >>> Fokko >>> >>> Op ma 4 nov 2024 om 14:03 schreef Jean-Baptiste Onofré <j...@nanthrax.net >>> >: >>> >>>> Hi Ajantha, >>>> >>>> During CommunityOverCode, I chatted with Matt Topol about Substrait and >>>> ADBC. >>>> I checked the Substrait support in DataFusion and it's interesting. >>>> >>>> I was thinking about where to actually store the Substrait plan (I was >>>> thinking about an intermediate SQL representation that we could store >>>> as a metadata instead of directly the plan). >>>> >>>> Maybe, we could start with a proposal document to explore the >>>> different options (and so follow Iceberg proposals process, creating a >>>> GitHub Issue with the proposal tag, and attaching the document) ? >>>> >>>> Thanks ! >>>> Regards >>>> JB >>>> >>>> On Mon, Nov 4, 2024 at 10:38 AM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> > >>>> > Thanks everyone for the detailed discussions. >>>> > >>>> > Looks like we have consensus towards Substrait. >>>> > Last time I checked it was not adopted by all the engines. But we can >>>> work towards the adoption as well. >>>> > >>>> > I will explore further on Substrait and come up with the design doc >>>> on the same. >>>> > >>>> > Thanks, >>>> > Ajantha >>>> > >>>> > On Mon, Oct 28, 2024 at 11:20 PM Amogh Jahagirdar <2am...@gmail.com> >>>> wrote: >>>> >> >>>> >> Hey all, >>>> >> >>>> >> I'm +1 in efforts to make views more interoperable across engines as >>>> I believe such efforts would be beneficial for the wider ecosystem. I think >>>> the way to do that is through higher fidelity IRs such as Substrait. >>>> >> >>>> >> I agree with Walaa that there's not really a valid distinction >>>> between IR vs non-IR projects when it comes to translation; my >>>> understanding is that in the end any translation framework would have to >>>> normalize to an intermediate representation. With the SQLGlot case, it's >>>> just that the IR is at the AST level and with the others they have higher >>>> fidelity to capture more accurate query semantics (correct me if I'm wrong >>>> here). As of today, it is already possible to use SQLGlot, translate to the >>>> desired SQL and store these SQL representations. However, since it's not as >>>> high fidelity as a proper IR layer, there are issues to consider like Fokko >>>> mentioned; but again, if users are happy with their results, they can do >>>> this today without any spec changes. >>>> >> >>>> >> In my opinion, the biggest hurdle for Substrait or any other IR to >>>> be a viable standard in Iceberg that's worth maintaining is that there >>>> would need to be consensus across different engine/language communities >>>> (e.g. Walaa referenced the Trino community's perspective on such IR >>>> layers). Otherwise it risks becoming something that's defined in the >>>> standard but really isn't well accepted which I think we all want to avoid. >>>> >> >>>> >> I think as a starting point, it would be great to sync with at least >>>> OSS engines/language communities and try and understand any concrete points >>>> of skepticism for considering such a standard. So far a lot of the points >>>> of skepticism as I read it are around such a layer being only considerate >>>> of 1 engine or having such substantial feature gaps that it can't be >>>> considered; but no concrete cases have been called out. >>>> >> Once we establish concrete gaps, I think then it would make sense to >>>> work with the respective IR community to help close those gaps or if needed >>>> consider other paths. >>>> >> >>>> >> Thanks, >>>> >> Amogh Jahagirdar >>>> >> >>>> >> On Mon, Oct 28, 2024 at 11:43 AM Piotr Findeisen < >>>> piotr.findei...@gmail.com> wrote: >>>> >>> >>>> >>> Hi, >>>> >>> >>>> >>> I have no experience with Substrait, but i agree it looks like the >>>> tool for the job. >>>> >>> Or, as I proposed earlier, we define our own Iceberg IR for the >>>> views. >>>> >>> >>>> >>> We can experiment with serialized IR being stored as a String with >>>> new dialect name, and this is how we should get this started. >>>> >>> It's probably good end solution as well, but the important >>>> value-add is if we manage to converge towards one shared IR that's "native >>>> to iceberg". >>>> >>> This would be best for the users -- more views would just work. >>>> >>> And best for long-term evolution of the project -- standardized IR >>>> would help things like incremental refreshes (for materialized views). >>>> >>> >>>> >>> Best >>>> >>> Piotr >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>> >>>> >>>> Hi Fokko, >>>> >>>> >>>> >>>> We can implement Python/Rust/Go clients to interop with the >>>> serialized Coral IR. Not sure if it makes sense to have all front-end and >>>> back-end implementations (e.g., Spark to Coral IR or Coral IR to Trino, >>>> etc) be reimplemented in those languages. Such implementations actually >>>> depend on the reuse of the native parsers of those dialects which are >>>> typically in Java (also this is to your point about the language coverage >>>> -- reusing native parsers is a principle that Coral follows to be compliant >>>> with the source dialect). I think making Python/Rust/Go interop/handle the >>>> IR (i.e., convert the serialized IR to in-memory IR and the other way >>>> around) would be a good start. For example, Python-specific backends and >>>> front-end implementations can follow from that. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Walaa. >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <fo...@apache.org> >>>> wrote: >>>> >>>>> >>>> >>>>> Hey everyone, >>>> >>>>> >>>> >>>>> Views in PyIceberg are not yet as mature as in Java, mostly >>>> because tooling in Python tends to work with data frames, rather than SQL. >>>> I do think it would be valuable to extend support there. >>>> >>>>> >>>> >>>>> I have a bit of experience in turning SQL into ASTs and extending >>>> grammar, and I'm confident to say that it is nearly impossible to cover all >>>> the grammar of a specific dialect. My main question is, what will SQLGlot >>>> do when we try to translate a dialect that it doesn't fully understand? >>>> Will it error out, or will it produce faulty SQL? A simple example can be >>>> functions that are not supported in other engines up to recursive CTE's. In >>>> this case, not failing upfront would cause correctness issues. >>>> >>>>> >>>> >>>>> Regarding Substrait. Within PyIceberg there was also successful >>>> experimentation of having a DuckDB query, sending it to PyIceberg to do the >>>> Iceberg query planning, and returning a physical plan to DuckDB to do the >>>> actual execution. This was still an early stage and required a lot of work >>>> around credentials and field-IDs, but it was quite promising. Using >>>> Substrait as views looks easier to me, and would also translate to a >>>> dataframe-based world. Walaa, do you have any outlook on Coral >>>> Python/Rust/Go support? >>>> >>>>> >>>> >>>>> Kind regards, >>>> >>>>> Fokko >>>> >>>>> >>>> >>>>> >>>> >>>>> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com>: >>>> >>>>>> >>>> >>>>>> I think this may need some more discussion. >>>> >>>>>> >>>> >>>>>> To me, a "serialized IR" is another form of a "dialect". In this >>>> case, this dialect will be mostly specific to Iceberg, and compute engines >>>> will still support reading views in their native SQL. There are some data >>>> points on this from the Trino community in a previous discussion [1]. In >>>> addition to being not directly consumable by engines, a serialized IR will >>>> be hard to consume by humans too. >>>> >>>>>> >>>> >>>>>> From that perspective, even if Iceberg adopts some form of a >>>> serialized IR, we will end up again doing translation, from that IR to the >>>> engine's dialect on view read time, and from the engine's dialect to that >>>> IR on the view write time. So serialized IR cannot eliminate translation. >>>> >>>>>> >>>> >>>>>> I think it is better to not quickly adopt the serialized IR path >>>> until it is proven to work and there is sufficient tooling and support >>>> around it, else it will end up being a constraint. >>>> >>>>>> >>>> >>>>>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are >>>> some fundamental differences between their approaches, mainly around the >>>> intermediate representation abstraction. Coral models both the AST and the >>>> logical plan of a query, making it able to capture the query semantics more >>>> accurately and hence perform precise transformations. On the flip side, >>>> SQLGlot abstraction is at the AST level only. Data type inference would be >>>> a major gap in any solution that does not capture the logical plan for >>>> example, yet very important to perform successful translation. This is >>>> backed up by some experiments we performed on actual queries and their >>>> translation results (from Spark to Trino, comparing results of Coral and >>>> SQLGlot). >>>> >>>>>> >>>> >>>>>> For the IR: Any translation solution (including Coral) must rely >>>> on an IR, and it has to be decoupled from any of the input and output >>>> dialects. This is true in the Coral case today. Such IR is the way to >>>> represent both the intermediate AST and logical plans. Therefore, I do not >>>> think we can necessarily split projects as "IR projects" vs not, since all >>>> solutions must use an IR. With that said, IR serialization is a matter of >>>> staging/milestones of the project. Serialized IR is next on Coral's >>>> roadmap. If Iceberg ends up adopting an IR, it might be a good idea to make >>>> Iceberg interoperable with a Coral-based serialized IR. This will make the >>>> compatibility with engines that adopt Coral (like Trino) much more robust >>>> and straightforward. >>>> >>>>>> >>>> >>>>>> [1] >>>> https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002 >>>> >>>>>> >>>> >>>>>> Thanks, >>>> >>>>>> Walaa. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>