Hi,

I have no experience with Substrait, but i agree it looks like the tool for
the job.
Or, as I proposed earlier, we define our own Iceberg IR for the views.

We can experiment with serialized IR being stored as a String with new
dialect name, and this is how we should get this started.
It's probably good end solution as well, but the important value-add is if
we manage to converge towards one shared IR that's "native to iceberg".
This would be best for the users -- more views would just work.
And best for long-term evolution of the project -- standardized IR would
help things like incremental refreshes (for materialized views).

Best
Piotr





On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Hi Fokko,
>
> We can implement Python/Rust/Go clients to interop with the serialized
> Coral IR. Not sure if it makes sense to have all front-end and back-end
> implementations (e.g., Spark to Coral IR or Coral IR to Trino, etc) be
> reimplemented in those languages. Such implementations actually depend on
> the reuse of the native parsers of those dialects which are typically in
> Java (also this is to your point about the language coverage --
> reusing native parsers is a principle that Coral follows to be compliant
> with the source dialect). I think making Python/Rust/Go interop/handle the
> IR (i.e., convert the serialized IR to in-memory IR and the other way
> around) would be a good start. For example, Python-specific backends and
> front-end implementations can follow from that.
>
> Thanks,
> Walaa.
>
>
> On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <fo...@apache.org> wrote:
>
>> Hey everyone,
>>
>> Views in PyIceberg are not yet as mature as in Java, mostly because
>> tooling in Python tends to work with data frames, rather than SQL. I do
>> think it would be valuable to extend support there.
>>
>> I have a bit of experience in turning SQL into ASTs and extending
>> grammar, and I'm confident to say that it is nearly impossible to cover all
>> the grammar of a specific dialect. My main question is, what will SQLGlot
>> do when we try to translate a dialect that it doesn't fully understand?
>> Will it error out, or will it produce faulty SQL? A simple example can be
>> functions that are not supported in other engines up to recursive CTE's. In
>> this case, not failing upfront would cause correctness issues.
>>
>> Regarding Substrait. Within PyIceberg there was also successful
>> experimentation of having a DuckDB query, sending it to PyIceberg to do the
>> Iceberg query planning, and returning a physical plan to DuckDB to do the
>> actual execution. This was still an early stage and required a lot of work
>> around credentials and field-IDs, but it was quite promising. Using
>> Substrait as views looks easier to me, and would also translate to a
>> dataframe-based world. Walaa, do you have any outlook on Coral
>> Python/Rust/Go support?
>>
>> Kind regards,
>> Fokko
>>
>>
>> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa <
>> wa.moust...@gmail.com>:
>>
>>> I think this may need some more discussion.
>>>
>>> To me, a "serialized IR" is another form of a "dialect". In this case,
>>> this dialect will be mostly specific to Iceberg, and compute engines
>>> will still support reading views in their native SQL. There are some data
>>> points on this from the Trino community in a previous discussion [1]. In
>>> addition to being not directly consumable by engines, a serialized IR will
>>> be hard to consume by humans too.
>>>
>>> From that perspective, even if Iceberg adopts some form of a serialized
>>> IR, we will end up again doing translation, from that IR to the engine's
>>> dialect on view read time, and from the engine's dialect to that IR on the
>>> view write time. So serialized IR cannot eliminate translation.
>>>
>>> I think it is better to not quickly adopt the serialized IR path until
>>> it is proven to work and there is sufficient tooling and support around it,
>>> else it will end up being a constraint.
>>>
>>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are some
>>> fundamental differences between their approaches, mainly around the
>>> intermediate representation abstraction. Coral models both the AST and the
>>> logical plan of a query, making it able to capture the query semantics more
>>> accurately and hence perform precise transformations. On the flip side,
>>> SQLGlot abstraction is at the AST level only. Data type inference would be
>>> a major gap in any solution that does not capture the logical plan for
>>> example, yet very important to perform successful translation. This is
>>> backed up by some experiments we performed on actual queries and their
>>> translation results (from Spark to Trino, comparing results of Coral and
>>> SQLGlot).
>>>
>>> For the IR: Any translation solution (including Coral) must rely on an
>>> IR, and it has to be decoupled from any of the input and output dialects.
>>> This is true in the Coral case today. Such IR is the way to represent both
>>> the intermediate AST and logical plans. Therefore, I do not think we can
>>> necessarily split projects as "IR projects" vs not, since all solutions
>>> must use an IR. With that said, IR serialization is a matter of
>>> staging/milestones of the project. Serialized IR is next on Coral's
>>> roadmap. If Iceberg ends up adopting an IR, it might be a good idea to make
>>> Iceberg interoperable with a Coral-based serialized IR. This will make the
>>> compatibility with engines that adopt Coral (like Trino) much more robust
>>> and straightforward.
>>>
>>> [1] https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>>
>>>

Reply via email to