Matt also just added `substrait.go` to the Iceberg-Go implementation that I was reviewing today: https://github.com/apache/iceberg-go/pull/185/files#diff-81cfac9f2e1dcf6265c569d0a3397964f0b78e07f45bb9dcdd3effe0623aaf73
That converts an Iceberg expression into a substrate one, pretty exciting stuff Kind regards, Fokko Op ma 4 nov 2024 om 14:03 schreef Jean-Baptiste Onofré <j...@nanthrax.net>: > Hi Ajantha, > > During CommunityOverCode, I chatted with Matt Topol about Substrait and > ADBC. > I checked the Substrait support in DataFusion and it's interesting. > > I was thinking about where to actually store the Substrait plan (I was > thinking about an intermediate SQL representation that we could store > as a metadata instead of directly the plan). > > Maybe, we could start with a proposal document to explore the > different options (and so follow Iceberg proposals process, creating a > GitHub Issue with the proposal tag, and attaching the document) ? > > Thanks ! > Regards > JB > > On Mon, Nov 4, 2024 at 10:38 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > > > > Thanks everyone for the detailed discussions. > > > > Looks like we have consensus towards Substrait. > > Last time I checked it was not adopted by all the engines. But we can > work towards the adoption as well. > > > > I will explore further on Substrait and come up with the design doc on > the same. > > > > Thanks, > > Ajantha > > > > On Mon, Oct 28, 2024 at 11:20 PM Amogh Jahagirdar <2am...@gmail.com> > wrote: > >> > >> Hey all, > >> > >> I'm +1 in efforts to make views more interoperable across engines as I > believe such efforts would be beneficial for the wider ecosystem. I think > the way to do that is through higher fidelity IRs such as Substrait. > >> > >> I agree with Walaa that there's not really a valid distinction between > IR vs non-IR projects when it comes to translation; my understanding is > that in the end any translation framework would have to normalize to an > intermediate representation. With the SQLGlot case, it's just that the IR > is at the AST level and with the others they have higher fidelity to > capture more accurate query semantics (correct me if I'm wrong here). As of > today, it is already possible to use SQLGlot, translate to the desired SQL > and store these SQL representations. However, since it's not as high > fidelity as a proper IR layer, there are issues to consider like Fokko > mentioned; but again, if users are happy with their results, they can do > this today without any spec changes. > >> > >> In my opinion, the biggest hurdle for Substrait or any other IR to be a > viable standard in Iceberg that's worth maintaining is that there would > need to be consensus across different engine/language communities (e.g. > Walaa referenced the Trino community's perspective on such IR layers). > Otherwise it risks becoming something that's defined in the standard but > really isn't well accepted which I think we all want to avoid. > >> > >> I think as a starting point, it would be great to sync with at least > OSS engines/language communities and try and understand any concrete points > of skepticism for considering such a standard. So far a lot of the points > of skepticism as I read it are around such a layer being only considerate > of 1 engine or having such substantial feature gaps that it can't be > considered; but no concrete cases have been called out. > >> Once we establish concrete gaps, I think then it would make sense to > work with the respective IR community to help close those gaps or if needed > consider other paths. > >> > >> Thanks, > >> Amogh Jahagirdar > >> > >> On Mon, Oct 28, 2024 at 11:43 AM Piotr Findeisen < > piotr.findei...@gmail.com> wrote: > >>> > >>> Hi, > >>> > >>> I have no experience with Substrait, but i agree it looks like the > tool for the job. > >>> Or, as I proposed earlier, we define our own Iceberg IR for the views. > >>> > >>> We can experiment with serialized IR being stored as a String with new > dialect name, and this is how we should get this started. > >>> It's probably good end solution as well, but the important value-add > is if we manage to converge towards one shared IR that's "native to > iceberg". > >>> This would be best for the users -- more views would just work. > >>> And best for long-term evolution of the project -- standardized IR > would help things like incremental refreshes (for materialized views). > >>> > >>> Best > >>> Piotr > >>> > >>> > >>> > >>> > >>> > >>> On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >>>> > >>>> Hi Fokko, > >>>> > >>>> We can implement Python/Rust/Go clients to interop with the > serialized Coral IR. Not sure if it makes sense to have all front-end and > back-end implementations (e.g., Spark to Coral IR or Coral IR to Trino, > etc) be reimplemented in those languages. Such implementations actually > depend on the reuse of the native parsers of those dialects which are > typically in Java (also this is to your point about the language coverage > -- reusing native parsers is a principle that Coral follows to be compliant > with the source dialect). I think making Python/Rust/Go interop/handle the > IR (i.e., convert the serialized IR to in-memory IR and the other way > around) would be a good start. For example, Python-specific backends and > front-end implementations can follow from that. > >>>> > >>>> Thanks, > >>>> Walaa. > >>>> > >>>> > >>>> On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <fo...@apache.org> > wrote: > >>>>> > >>>>> Hey everyone, > >>>>> > >>>>> Views in PyIceberg are not yet as mature as in Java, mostly because > tooling in Python tends to work with data frames, rather than SQL. I do > think it would be valuable to extend support there. > >>>>> > >>>>> I have a bit of experience in turning SQL into ASTs and extending > grammar, and I'm confident to say that it is nearly impossible to cover all > the grammar of a specific dialect. My main question is, what will SQLGlot > do when we try to translate a dialect that it doesn't fully understand? > Will it error out, or will it produce faulty SQL? A simple example can be > functions that are not supported in other engines up to recursive CTE's. In > this case, not failing upfront would cause correctness issues. > >>>>> > >>>>> Regarding Substrait. Within PyIceberg there was also successful > experimentation of having a DuckDB query, sending it to PyIceberg to do the > Iceberg query planning, and returning a physical plan to DuckDB to do the > actual execution. This was still an early stage and required a lot of work > around credentials and field-IDs, but it was quite promising. Using > Substrait as views looks easier to me, and would also translate to a > dataframe-based world. Walaa, do you have any outlook on Coral > Python/Rust/Go support? > >>>>> > >>>>> Kind regards, > >>>>> Fokko > >>>>> > >>>>> > >>>>> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa < > wa.moust...@gmail.com>: > >>>>>> > >>>>>> I think this may need some more discussion. > >>>>>> > >>>>>> To me, a "serialized IR" is another form of a "dialect". In this > case, this dialect will be mostly specific to Iceberg, and compute engines > will still support reading views in their native SQL. There are some data > points on this from the Trino community in a previous discussion [1]. In > addition to being not directly consumable by engines, a serialized IR will > be hard to consume by humans too. > >>>>>> > >>>>>> From that perspective, even if Iceberg adopts some form of a > serialized IR, we will end up again doing translation, from that IR to the > engine's dialect on view read time, and from the engine's dialect to that > IR on the view write time. So serialized IR cannot eliminate translation. > >>>>>> > >>>>>> I think it is better to not quickly adopt the serialized IR path > until it is proven to work and there is sufficient tooling and support > around it, else it will end up being a constraint. > >>>>>> > >>>>>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are some > fundamental differences between their approaches, mainly around the > intermediate representation abstraction. Coral models both the AST and the > logical plan of a query, making it able to capture the query semantics more > accurately and hence perform precise transformations. On the flip side, > SQLGlot abstraction is at the AST level only. Data type inference would be > a major gap in any solution that does not capture the logical plan for > example, yet very important to perform successful translation. This is > backed up by some experiments we performed on actual queries and their > translation results (from Spark to Trino, comparing results of Coral and > SQLGlot). > >>>>>> > >>>>>> For the IR: Any translation solution (including Coral) must rely on > an IR, and it has to be decoupled from any of the input and output > dialects. This is true in the Coral case today. Such IR is the way to > represent both the intermediate AST and logical plans. Therefore, I do not > think we can necessarily split projects as "IR projects" vs not, since all > solutions must use an IR. With that said, IR serialization is a matter of > staging/milestones of the project. Serialized IR is next on Coral's > roadmap. If Iceberg ends up adopting an IR, it might be a good idea to make > Iceberg interoperable with a Coral-based serialized IR. This will make the > compatibility with engines that adopt Coral (like Trino) much more robust > and straightforward. > >>>>>> > >>>>>> [1] > https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002 > >>>>>> > >>>>>> Thanks, > >>>>>> Walaa. > >>>>>> > >>>>>> > >>>>>> >