Hi Ajantha, During CommunityOverCode, I chatted with Matt Topol about Substrait and ADBC. I checked the Substrait support in DataFusion and it's interesting.
I was thinking about where to actually store the Substrait plan (I was thinking about an intermediate SQL representation that we could store as a metadata instead of directly the plan). Maybe, we could start with a proposal document to explore the different options (and so follow Iceberg proposals process, creating a GitHub Issue with the proposal tag, and attaching the document) ? Thanks ! Regards JB On Mon, Nov 4, 2024 at 10:38 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > > Thanks everyone for the detailed discussions. > > Looks like we have consensus towards Substrait. > Last time I checked it was not adopted by all the engines. But we can work > towards the adoption as well. > > I will explore further on Substrait and come up with the design doc on the > same. > > Thanks, > Ajantha > > On Mon, Oct 28, 2024 at 11:20 PM Amogh Jahagirdar <2am...@gmail.com> wrote: >> >> Hey all, >> >> I'm +1 in efforts to make views more interoperable across engines as I >> believe such efforts would be beneficial for the wider ecosystem. I think >> the way to do that is through higher fidelity IRs such as Substrait. >> >> I agree with Walaa that there's not really a valid distinction between IR vs >> non-IR projects when it comes to translation; my understanding is that in >> the end any translation framework would have to normalize to an intermediate >> representation. With the SQLGlot case, it's just that the IR is at the AST >> level and with the others they have higher fidelity to capture more accurate >> query semantics (correct me if I'm wrong here). As of today, it is already >> possible to use SQLGlot, translate to the desired SQL and store these SQL >> representations. However, since it's not as high fidelity as a proper IR >> layer, there are issues to consider like Fokko mentioned; but again, if >> users are happy with their results, they can do this today without any spec >> changes. >> >> In my opinion, the biggest hurdle for Substrait or any other IR to be a >> viable standard in Iceberg that's worth maintaining is that there would need >> to be consensus across different engine/language communities (e.g. Walaa >> referenced the Trino community's perspective on such IR layers). Otherwise >> it risks becoming something that's defined in the standard but really isn't >> well accepted which I think we all want to avoid. >> >> I think as a starting point, it would be great to sync with at least OSS >> engines/language communities and try and understand any concrete points of >> skepticism for considering such a standard. So far a lot of the points of >> skepticism as I read it are around such a layer being only considerate of 1 >> engine or having such substantial feature gaps that it can't be considered; >> but no concrete cases have been called out. >> Once we establish concrete gaps, I think then it would make sense to work >> with the respective IR community to help close those gaps or if needed >> consider other paths. >> >> Thanks, >> Amogh Jahagirdar >> >> On Mon, Oct 28, 2024 at 11:43 AM Piotr Findeisen <piotr.findei...@gmail.com> >> wrote: >>> >>> Hi, >>> >>> I have no experience with Substrait, but i agree it looks like the tool for >>> the job. >>> Or, as I proposed earlier, we define our own Iceberg IR for the views. >>> >>> We can experiment with serialized IR being stored as a String with new >>> dialect name, and this is how we should get this started. >>> It's probably good end solution as well, but the important value-add is if >>> we manage to converge towards one shared IR that's "native to iceberg". >>> This would be best for the users -- more views would just work. >>> And best for long-term evolution of the project -- standardized IR would >>> help things like incremental refreshes (for materialized views). >>> >>> Best >>> Piotr >>> >>> >>> >>> >>> >>> On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa <wa.moust...@gmail.com> >>> wrote: >>>> >>>> Hi Fokko, >>>> >>>> We can implement Python/Rust/Go clients to interop with the serialized >>>> Coral IR. Not sure if it makes sense to have all front-end and back-end >>>> implementations (e.g., Spark to Coral IR or Coral IR to Trino, etc) be >>>> reimplemented in those languages. Such implementations actually depend on >>>> the reuse of the native parsers of those dialects which are typically in >>>> Java (also this is to your point about the language coverage -- reusing >>>> native parsers is a principle that Coral follows to be compliant with the >>>> source dialect). I think making Python/Rust/Go interop/handle the IR >>>> (i.e., convert the serialized IR to in-memory IR and the other way around) >>>> would be a good start. For example, Python-specific backends and front-end >>>> implementations can follow from that. >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> >>>> On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <fo...@apache.org> wrote: >>>>> >>>>> Hey everyone, >>>>> >>>>> Views in PyIceberg are not yet as mature as in Java, mostly because >>>>> tooling in Python tends to work with data frames, rather than SQL. I do >>>>> think it would be valuable to extend support there. >>>>> >>>>> I have a bit of experience in turning SQL into ASTs and extending >>>>> grammar, and I'm confident to say that it is nearly impossible to cover >>>>> all the grammar of a specific dialect. My main question is, what will >>>>> SQLGlot do when we try to translate a dialect that it doesn't fully >>>>> understand? Will it error out, or will it produce faulty SQL? A simple >>>>> example can be functions that are not supported in other engines up to >>>>> recursive CTE's. In this case, not failing upfront would cause >>>>> correctness issues. >>>>> >>>>> Regarding Substrait. Within PyIceberg there was also successful >>>>> experimentation of having a DuckDB query, sending it to PyIceberg to do >>>>> the Iceberg query planning, and returning a physical plan to DuckDB to do >>>>> the actual execution. This was still an early stage and required a lot of >>>>> work around credentials and field-IDs, but it was quite promising. Using >>>>> Substrait as views looks easier to me, and would also translate to a >>>>> dataframe-based world. Walaa, do you have any outlook on Coral >>>>> Python/Rust/Go support? >>>>> >>>>> Kind regards, >>>>> Fokko >>>>> >>>>> >>>>> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa >>>>> <wa.moust...@gmail.com>: >>>>>> >>>>>> I think this may need some more discussion. >>>>>> >>>>>> To me, a "serialized IR" is another form of a "dialect". In this case, >>>>>> this dialect will be mostly specific to Iceberg, and compute engines >>>>>> will still support reading views in their native SQL. There are some >>>>>> data points on this from the Trino community in a previous discussion >>>>>> [1]. In addition to being not directly consumable by engines, a >>>>>> serialized IR will be hard to consume by humans too. >>>>>> >>>>>> From that perspective, even if Iceberg adopts some form of a serialized >>>>>> IR, we will end up again doing translation, from that IR to the engine's >>>>>> dialect on view read time, and from the engine's dialect to that IR on >>>>>> the view write time. So serialized IR cannot eliminate translation. >>>>>> >>>>>> I think it is better to not quickly adopt the serialized IR path until >>>>>> it is proven to work and there is sufficient tooling and support around >>>>>> it, else it will end up being a constraint. >>>>>> >>>>>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are some >>>>>> fundamental differences between their approaches, mainly around the >>>>>> intermediate representation abstraction. Coral models both the AST and >>>>>> the logical plan of a query, making it able to capture the query >>>>>> semantics more accurately and hence perform precise transformations. On >>>>>> the flip side, SQLGlot abstraction is at the AST level only. Data type >>>>>> inference would be a major gap in any solution that does not capture the >>>>>> logical plan for example, yet very important to perform successful >>>>>> translation. This is backed up by some experiments we performed on >>>>>> actual queries and their translation results (from Spark to Trino, >>>>>> comparing results of Coral and SQLGlot). >>>>>> >>>>>> For the IR: Any translation solution (including Coral) must rely on an >>>>>> IR, and it has to be decoupled from any of the input and output >>>>>> dialects. This is true in the Coral case today. Such IR is the way to >>>>>> represent both the intermediate AST and logical plans. Therefore, I do >>>>>> not think we can necessarily split projects as "IR projects" vs not, >>>>>> since all solutions must use an IR. With that said, IR serialization is >>>>>> a matter of staging/milestones of the project. Serialized IR is next on >>>>>> Coral's roadmap. If Iceberg ends up adopting an IR, it might be a good >>>>>> idea to make Iceberg interoperable with a Coral-based serialized IR. >>>>>> This will make the compatibility with engines that adopt Coral (like >>>>>> Trino) much more robust and straightforward. >>>>>> >>>>>> [1] https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002 >>>>>> >>>>>> Thanks, >>>>>> Walaa. >>>>>> >>>>>> >>>>>>