Re: [Discuss] Iceberg View Interoperability

Jean-Baptiste Onofré Mon, 04 Nov 2024 05:04:28 -0800

Hi Ajantha,

During CommunityOverCode, I chatted with Matt Topol about Substrait and ADBC.
I checked the Substrait support in DataFusion and it's interesting.


I was thinking about where to actually store the Substrait plan (I was
thinking about an intermediate SQL representation that we could store
as a metadata instead of directly the plan).

Maybe, we could start with a proposal document to explore the
different options (and so follow Iceberg proposals process, creating a
GitHub Issue with the proposal tag, and attaching the document) ?

Thanks !
Regards
JB

On Mon, Nov 4, 2024 at 10:38 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:
>
> Thanks everyone for the detailed discussions.
>
> Looks like we have consensus towards Substrait.
> Last time I checked it was not adopted by all the engines. But we can work 
> towards the adoption as well.
>
> I will explore further on Substrait and come up with the design doc on the 
> same.
>
> Thanks,
> Ajantha
>
> On Mon, Oct 28, 2024 at 11:20 PM Amogh Jahagirdar <2am...@gmail.com> wrote:
>>
>> Hey all,
>>
>> I'm +1 in efforts to make views more interoperable across engines as I 
>> believe such efforts would be beneficial for the wider ecosystem. I think 
>> the way to do that is through higher fidelity IRs such as Substrait.
>>
>> I agree with Walaa that there's not really a valid distinction between IR vs 
>> non-IR projects when it comes to translation; my understanding is that in 
>> the end any translation framework would have to normalize to an intermediate 
>> representation. With the SQLGlot case, it's just that the IR is at the AST 
>> level and with the others they have higher fidelity to capture more accurate 
>> query semantics (correct me if I'm wrong here). As of today, it is already 
>> possible to use SQLGlot, translate to the desired SQL and store these SQL 
>> representations. However, since it's not as high fidelity as a proper IR 
>> layer, there are issues to consider like Fokko mentioned; but again, if 
>> users are happy with their results, they can do this today without any spec 
>> changes.
>>
>> In my opinion, the biggest hurdle for Substrait or any other IR to be a 
>> viable standard in Iceberg that's worth maintaining is that there would need 
>> to be consensus across different engine/language communities (e.g. Walaa 
>> referenced the Trino community's perspective on such IR layers). Otherwise 
>> it risks becoming something that's defined in the standard but really isn't 
>> well accepted which I think we all want to avoid.
>>
>> I think as a starting point, it would be great to sync with at least OSS 
>> engines/language communities and try and understand any concrete points of 
>> skepticism for considering such a standard. So far a lot of the points of 
>> skepticism as I read it are around such a layer being only considerate of 1 
>> engine or having such substantial feature gaps that it can't be considered; 
>> but no concrete cases have been called out.
>> Once we establish concrete gaps, I think then it would make sense to work 
>> with the respective IR community to help close those gaps or if needed 
>> consider other paths.
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>> On Mon, Oct 28, 2024 at 11:43 AM Piotr Findeisen <piotr.findei...@gmail.com> 
>> wrote:
>>>
>>> Hi,
>>>
>>> I have no experience with Substrait, but i agree it looks like the tool for 
>>> the job.
>>> Or, as I proposed earlier, we define our own Iceberg IR for the views.
>>>
>>> We can experiment with serialized IR being stored as a String with new 
>>> dialect name, and this is how we should get this started.
>>> It's probably good end solution as well, but the important value-add is if 
>>> we manage to converge towards one shared IR that's "native to iceberg".
>>> This would be best for the users -- more views would just work.
>>> And best for long-term evolution of the project -- standardized IR would 
>>> help things like incremental refreshes (for materialized views).
>>>
>>> Best
>>> Piotr
>>>
>>>
>>>
>>>
>>>
>>> On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa <wa.moust...@gmail.com> 
>>> wrote:
>>>>
>>>> Hi Fokko,
>>>>
>>>> We can implement Python/Rust/Go clients to interop with the serialized 
>>>> Coral IR. Not sure if it makes sense to have all front-end and back-end 
>>>> implementations (e.g., Spark to Coral IR or Coral IR to Trino, etc) be 
>>>> reimplemented in those languages. Such implementations actually depend on 
>>>> the reuse of the native parsers of those dialects which are typically in 
>>>> Java (also this is to your point about the language coverage -- reusing 
>>>> native parsers is a principle that Coral follows to be compliant with the 
>>>> source dialect). I think making Python/Rust/Go interop/handle the IR 
>>>> (i.e., convert the serialized IR to in-memory IR and the other way around) 
>>>> would be a good start. For example, Python-specific backends and front-end 
>>>> implementations can follow from that.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <fo...@apache.org> wrote:
>>>>>
>>>>> Hey everyone,
>>>>>
>>>>> Views in PyIceberg are not yet as mature as in Java, mostly because 
>>>>> tooling in Python tends to work with data frames, rather than SQL. I do 
>>>>> think it would be valuable to extend support there.
>>>>>
>>>>> I have a bit of experience in turning SQL into ASTs and extending 
>>>>> grammar, and I'm confident to say that it is nearly impossible to cover 
>>>>> all the grammar of a specific dialect. My main question is, what will 
>>>>> SQLGlot do when we try to translate a dialect that it doesn't fully 
>>>>> understand? Will it error out, or will it produce faulty SQL? A simple 
>>>>> example can be functions that are not supported in other engines up to 
>>>>> recursive CTE's. In this case, not failing upfront would cause 
>>>>> correctness issues.
>>>>>
>>>>> Regarding Substrait. Within PyIceberg there was also successful 
>>>>> experimentation of having a DuckDB query, sending it to PyIceberg to do 
>>>>> the Iceberg query planning, and returning a physical plan to DuckDB to do 
>>>>> the actual execution. This was still an early stage and required a lot of 
>>>>> work around credentials and field-IDs, but it was quite promising. Using 
>>>>> Substrait as views looks easier to me, and would also translate to a 
>>>>> dataframe-based world. Walaa, do you have any outlook on Coral 
>>>>> Python/Rust/Go support?
>>>>>
>>>>> Kind regards,
>>>>> Fokko
>>>>>
>>>>>
>>>>> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa 
>>>>> <wa.moust...@gmail.com>:
>>>>>>
>>>>>> I think this may need some more discussion.
>>>>>>
>>>>>> To me, a "serialized IR" is another form of a "dialect". In this case, 
>>>>>> this dialect will be mostly specific to Iceberg, and compute engines 
>>>>>> will still support reading views in their native SQL. There are some 
>>>>>> data points on this from the Trino community in a previous discussion 
>>>>>> [1]. In addition to being not directly consumable by engines, a 
>>>>>> serialized IR will be hard to consume by humans too.
>>>>>>
>>>>>> From that perspective, even if Iceberg adopts some form of a serialized 
>>>>>> IR, we will end up again doing translation, from that IR to the engine's 
>>>>>> dialect on view read time, and from the engine's dialect to that IR on 
>>>>>> the view write time. So serialized IR cannot eliminate translation.
>>>>>>
>>>>>> I think it is better to not quickly adopt the serialized IR path until 
>>>>>> it is proven to work and there is sufficient tooling and support around 
>>>>>> it, else it will end up being a constraint.
>>>>>>
>>>>>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are some 
>>>>>> fundamental differences between their approaches, mainly around the 
>>>>>> intermediate representation abstraction. Coral models both the AST and 
>>>>>> the logical plan of a query, making it able to capture the query 
>>>>>> semantics more accurately and hence perform precise transformations. On 
>>>>>> the flip side, SQLGlot abstraction is at the AST level only. Data type 
>>>>>> inference would be a major gap in any solution that does not capture the 
>>>>>> logical plan for example, yet very important to perform successful 
>>>>>> translation. This is backed up by some experiments we performed on 
>>>>>> actual queries and their translation results (from Spark to Trino, 
>>>>>> comparing results of Coral and SQLGlot).
>>>>>>
>>>>>> For the IR: Any translation solution (including Coral) must rely on an 
>>>>>> IR, and it has to be decoupled from any of the input and output 
>>>>>> dialects. This is true in the Coral case today. Such IR is the way to 
>>>>>> represent both the intermediate AST and logical plans. Therefore, I do 
>>>>>> not think we can necessarily split projects as "IR projects" vs not, 
>>>>>> since all solutions must use an IR. With that said, IR serialization is 
>>>>>> a matter of staging/milestones of the project. Serialized IR is next on 
>>>>>> Coral's roadmap. If Iceberg ends up adopting an IR, it might be a good 
>>>>>> idea to make Iceberg interoperable with a Coral-based serialized IR. 
>>>>>> This will make the compatibility with engines that adopt Coral (like 
>>>>>> Trino) much more robust and straightforward.
>>>>>>
>>>>>> [1] https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>>

Re: [Discuss] Iceberg View Interoperability

Reply via email to