Hey all, I'm +1 in efforts to make views more interoperable across engines as I believe such efforts would be beneficial for the wider ecosystem. I think the way to do that is through higher fidelity IRs such as Substrait.
I agree with Walaa that there's not really a valid distinction between IR vs non-IR projects when it comes to translation; my understanding is that in the end any translation framework would have to normalize to an intermediate representation. With the SQLGlot case, it's just that the IR is at the AST level and with the others they have higher fidelity to capture more accurate query semantics (correct me if I'm wrong here). As of today, it is already possible to use SQLGlot, translate to the desired SQL and store these SQL representations. However, since it's not as high fidelity as a proper IR layer, there are issues to consider like Fokko mentioned; but again, if users are happy with their results, they can do this today without any spec changes. In my opinion, the biggest hurdle for Substrait or any other IR to be a viable standard in Iceberg that's worth maintaining is that there would need to be consensus across different engine/language communities (e.g. Walaa referenced the Trino community's perspective on such IR layers). Otherwise it risks becoming something that's defined in the standard but really isn't well accepted which I think we all want to avoid. I think as a starting point, it would be great to sync with at least OSS engines/language communities and try and understand any concrete points of skepticism for considering such a standard. So far a lot of the points of skepticism as I read it are around such a layer being only considerate of 1 engine or having such substantial feature gaps that it can't be considered; but no concrete cases have been called out. Once we establish concrete gaps, I think then it would make sense to work with the respective IR community to help close those gaps or if needed consider other paths. Thanks, Amogh Jahagirdar On Mon, Oct 28, 2024 at 11:43 AM Piotr Findeisen <piotr.findei...@gmail.com> wrote: > Hi, > > I have no experience with Substrait, but i agree it looks like the tool > for the job. > Or, as I proposed earlier, we define our own Iceberg IR for the views. > > We can experiment with serialized IR being stored as a String with new > dialect name, and this is how we should get this started. > It's probably good end solution as well, but the important value-add is if > we manage to converge towards one shared IR that's "native to iceberg". > This would be best for the users -- more views would just work. > And best for long-term evolution of the project -- standardized IR would > help things like incremental refreshes (for materialized views). > > Best > Piotr > > > > > > On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa <wa.moust...@gmail.com> > wrote: > >> Hi Fokko, >> >> We can implement Python/Rust/Go clients to interop with the serialized >> Coral IR. Not sure if it makes sense to have all front-end and back-end >> implementations (e.g., Spark to Coral IR or Coral IR to Trino, etc) be >> reimplemented in those languages. Such implementations actually depend on >> the reuse of the native parsers of those dialects which are typically in >> Java (also this is to your point about the language coverage -- >> reusing native parsers is a principle that Coral follows to be compliant >> with the source dialect). I think making Python/Rust/Go interop/handle the >> IR (i.e., convert the serialized IR to in-memory IR and the other way >> around) would be a good start. For example, Python-specific backends and >> front-end implementations can follow from that. >> >> Thanks, >> Walaa. >> >> >> On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <fo...@apache.org> >> wrote: >> >>> Hey everyone, >>> >>> Views in PyIceberg are not yet as mature as in Java, mostly because >>> tooling in Python tends to work with data frames, rather than SQL. I do >>> think it would be valuable to extend support there. >>> >>> I have a bit of experience in turning SQL into ASTs and extending >>> grammar, and I'm confident to say that it is nearly impossible to cover all >>> the grammar of a specific dialect. My main question is, what will SQLGlot >>> do when we try to translate a dialect that it doesn't fully understand? >>> Will it error out, or will it produce faulty SQL? A simple example can be >>> functions that are not supported in other engines up to recursive CTE's. In >>> this case, not failing upfront would cause correctness issues. >>> >>> Regarding Substrait. Within PyIceberg there was also successful >>> experimentation of having a DuckDB query, sending it to PyIceberg to do the >>> Iceberg query planning, and returning a physical plan to DuckDB to do the >>> actual execution. This was still an early stage and required a lot of work >>> around credentials and field-IDs, but it was quite promising. Using >>> Substrait as views looks easier to me, and would also translate to a >>> dataframe-based world. Walaa, do you have any outlook on Coral >>> Python/Rust/Go support? >>> >>> Kind regards, >>> Fokko >>> >>> >>> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa < >>> wa.moust...@gmail.com>: >>> >>>> I think this may need some more discussion. >>>> >>>> To me, a "serialized IR" is another form of a "dialect". In this case, >>>> this dialect will be mostly specific to Iceberg, and compute engines >>>> will still support reading views in their native SQL. There are some data >>>> points on this from the Trino community in a previous discussion [1]. In >>>> addition to being not directly consumable by engines, a serialized IR will >>>> be hard to consume by humans too. >>>> >>>> From that perspective, even if Iceberg adopts some form of a serialized >>>> IR, we will end up again doing translation, from that IR to the engine's >>>> dialect on view read time, and from the engine's dialect to that IR on the >>>> view write time. So serialized IR cannot eliminate translation. >>>> >>>> I think it is better to not quickly adopt the serialized IR path until >>>> it is proven to work and there is sufficient tooling and support around it, >>>> else it will end up being a constraint. >>>> >>>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are some >>>> fundamental differences between their approaches, mainly around the >>>> intermediate representation abstraction. Coral models both the AST and the >>>> logical plan of a query, making it able to capture the query semantics more >>>> accurately and hence perform precise transformations. On the flip side, >>>> SQLGlot abstraction is at the AST level only. Data type inference would be >>>> a major gap in any solution that does not capture the logical plan for >>>> example, yet very important to perform successful translation. This is >>>> backed up by some experiments we performed on actual queries and their >>>> translation results (from Spark to Trino, comparing results of Coral and >>>> SQLGlot). >>>> >>>> For the IR: Any translation solution (including Coral) must rely on an >>>> IR, and it has to be decoupled from any of the input and output dialects. >>>> This is true in the Coral case today. Such IR is the way to represent both >>>> the intermediate AST and logical plans. Therefore, I do not think we can >>>> necessarily split projects as "IR projects" vs not, since all solutions >>>> must use an IR. With that said, IR serialization is a matter of >>>> staging/milestones of the project. Serialized IR is next on Coral's >>>> roadmap. If Iceberg ends up adopting an IR, it might be a good idea to make >>>> Iceberg interoperable with a Coral-based serialized IR. This will make the >>>> compatibility with engines that adopt Coral (like Trino) much more robust >>>> and straightforward. >>>> >>>> [1] https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002 >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> >>>> >>>>