Re: [Discuss] Iceberg View Interoperability

Ajantha Bhat Mon, 04 Nov 2024 01:38:45 -0800

Thanks everyone for the detailed discussions.

Looks like we have consensus towards Substrait.
Last time I checked it was not adopted by all the engines. But we can work
towards the adoption as well.


I will explore further on Substrait and come up with the design doc on the
same.

Thanks,
Ajantha

On Mon, Oct 28, 2024 at 11:20 PM Amogh Jahagirdar <2am...@gmail.com> wrote:

> Hey all,
>
> I'm +1 in efforts to make views more interoperable across engines as I
> believe such efforts would be beneficial for the wider ecosystem. I think
> the way to do that is through higher fidelity IRs such as Substrait.
>
> I agree with Walaa that there's not really a valid distinction between IR
> vs non-IR projects when it comes to translation; my understanding is that
> in the end any translation framework would have to normalize to an
> intermediate representation. With the SQLGlot case, it's just that the IR
> is at the AST level and with the others they have higher fidelity to
> capture more accurate query semantics (correct me if I'm wrong here). As of
> today, it is already possible to use SQLGlot, translate to the desired SQL
> and store these SQL representations. However, since it's not as high
> fidelity as a proper IR layer, there are issues to consider like Fokko
> mentioned; but again, if users are happy with their results, they can do
> this today without any spec changes.
>
> In my opinion, the biggest hurdle for Substrait or any other IR to be a
> viable standard in Iceberg that's worth maintaining is that there would
> need to be consensus across different engine/language communities (e.g.
> Walaa referenced the Trino community's perspective on such IR layers).
> Otherwise it risks becoming something that's defined in the standard but
> really isn't well accepted which I think we all want to avoid.
>
> I think as a starting point, it would be great to sync with at least OSS
> engines/language communities and try and understand any concrete points of
> skepticism for considering such a standard. So far a lot of the points of
> skepticism as I read it are around such a layer being only considerate of 1
> engine or having such substantial feature gaps that it can't be considered;
> but no concrete cases have been called out.
> Once we establish concrete gaps, I think then it would make sense to work
> with the respective IR community to help close those gaps or if needed
> consider other paths.
>
> Thanks,
> Amogh Jahagirdar
>
> On Mon, Oct 28, 2024 at 11:43 AM Piotr Findeisen <
> piotr.findei...@gmail.com> wrote:
>
>> Hi,
>>
>> I have no experience with Substrait, but i agree it looks like the tool
>> for the job.
>> Or, as I proposed earlier, we define our own Iceberg IR for the views.
>>
>> We can experiment with serialized IR being stored as a String with new
>> dialect name, and this is how we should get this started.
>> It's probably good end solution as well, but the important value-add is
>> if we manage to converge towards one shared IR that's "native to iceberg".
>> This would be best for the users -- more views would just work.
>> And best for long-term evolution of the project -- standardized IR would
>> help things like incremental refreshes (for materialized views).
>>
>> Best
>> Piotr
>>
>>
>>
>>
>>
>> On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa <wa.moust...@gmail.com>
>> wrote:
>>
>>> Hi Fokko,
>>>
>>> We can implement Python/Rust/Go clients to interop with the serialized
>>> Coral IR. Not sure if it makes sense to have all front-end and back-end
>>> implementations (e.g., Spark to Coral IR or Coral IR to Trino, etc) be
>>> reimplemented in those languages. Such implementations actually depend on
>>> the reuse of the native parsers of those dialects which are typically in
>>> Java (also this is to your point about the language coverage --
>>> reusing native parsers is a principle that Coral follows to be compliant
>>> with the source dialect). I think making Python/Rust/Go interop/handle the
>>> IR (i.e., convert the serialized IR to in-memory IR and the other way
>>> around) would be a good start. For example, Python-specific backends and
>>> front-end implementations can follow from that.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <fo...@apache.org>
>>> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>> Views in PyIceberg are not yet as mature as in Java, mostly because
>>>> tooling in Python tends to work with data frames, rather than SQL. I do
>>>> think it would be valuable to extend support there.
>>>>
>>>> I have a bit of experience in turning SQL into ASTs and extending
>>>> grammar, and I'm confident to say that it is nearly impossible to cover all
>>>> the grammar of a specific dialect. My main question is, what will SQLGlot
>>>> do when we try to translate a dialect that it doesn't fully understand?
>>>> Will it error out, or will it produce faulty SQL? A simple example can be
>>>> functions that are not supported in other engines up to recursive CTE's. In
>>>> this case, not failing upfront would cause correctness issues.
>>>>
>>>> Regarding Substrait. Within PyIceberg there was also successful
>>>> experimentation of having a DuckDB query, sending it to PyIceberg to do the
>>>> Iceberg query planning, and returning a physical plan to DuckDB to do the
>>>> actual execution. This was still an early stage and required a lot of work
>>>> around credentials and field-IDs, but it was quite promising. Using
>>>> Substrait as views looks easier to me, and would also translate to a
>>>> dataframe-based world. Walaa, do you have any outlook on Coral
>>>> Python/Rust/Go support?
>>>>
>>>> Kind regards,
>>>> Fokko
>>>>
>>>>
>>>> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com>:
>>>>
>>>>> I think this may need some more discussion.
>>>>>
>>>>> To me, a "serialized IR" is another form of a "dialect". In this case,
>>>>> this dialect will be mostly specific to Iceberg, and compute engines
>>>>> will still support reading views in their native SQL. There are some data
>>>>> points on this from the Trino community in a previous discussion [1]. In
>>>>> addition to being not directly consumable by engines, a serialized IR will
>>>>> be hard to consume by humans too.
>>>>>
>>>>> From that perspective, even if Iceberg adopts some form of a
>>>>> serialized IR, we will end up again doing translation, from that IR to the
>>>>> engine's dialect on view read time, and from the engine's dialect to that
>>>>> IR on the view write time. So serialized IR cannot eliminate translation.
>>>>>
>>>>> I think it is better to not quickly adopt the serialized IR path until
>>>>> it is proven to work and there is sufficient tooling and support around 
>>>>> it,
>>>>> else it will end up being a constraint.
>>>>>
>>>>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are some
>>>>> fundamental differences between their approaches, mainly around the
>>>>> intermediate representation abstraction. Coral models both the AST and the
>>>>> logical plan of a query, making it able to capture the query semantics 
>>>>> more
>>>>> accurately and hence perform precise transformations. On the flip side,
>>>>> SQLGlot abstraction is at the AST level only. Data type inference would be
>>>>> a major gap in any solution that does not capture the logical plan for
>>>>> example, yet very important to perform successful translation. This is
>>>>> backed up by some experiments we performed on actual queries and their
>>>>> translation results (from Spark to Trino, comparing results of Coral and
>>>>> SQLGlot).
>>>>>
>>>>> For the IR: Any translation solution (including Coral) must rely on an
>>>>> IR, and it has to be decoupled from any of the input and output dialects.
>>>>> This is true in the Coral case today. Such IR is the way to represent both
>>>>> the intermediate AST and logical plans. Therefore, I do not think we can
>>>>> necessarily split projects as "IR projects" vs not, since all solutions
>>>>> must use an IR. With that said, IR serialization is a matter of
>>>>> staging/milestones of the project. Serialized IR is next on Coral's
>>>>> roadmap. If Iceberg ends up adopting an IR, it might be a good idea to 
>>>>> make
>>>>> Iceberg interoperable with a Coral-based serialized IR. This will make the
>>>>> compatibility with engines that adopt Coral (like Trino) much more robust
>>>>> and straightforward.
>>>>>
>>>>> [1]
>>>>> https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>>
>>>>>

Re: [Discuss] Iceberg View Interoperability

Reply via email to