Re: [Discuss] Iceberg View Interoperability

Matt Topol Mon, 04 Nov 2024 08:40:14 -0800

For reference, there are two reasons why I chose to add that substrait.go:

1) The Golang Arrow implementation has a compute package which is able to
evaluate substrait expressions as long as the kernels exist in the package.


2) Along the lines of this conversation, I wanted to be able to generically
create Substrait expressions from iceberg expressions. With the goal being
that the go implementation could potentially be able to create a full
substrait plan (including the reading) from an iceberg table (and metadata)
and expression. Eventually the plan would be able to be sent to a compute
engine which wouldn't have to know anything about iceberg to execute it!

On Mon, Nov 4, 2024, 5:34 PM Fokko Driesprong <fo...@apache.org> wrote:

> Matt also just added `substrait.go` to the Iceberg-Go implementation that
> I was reviewing today:
>
> https://github.com/apache/iceberg-go/pull/185/files#diff-81cfac9f2e1dcf6265c569d0a3397964f0b78e07f45bb9dcdd3effe0623aaf73
>
> That converts an Iceberg expression into a substrate one, pretty exciting
> stuff
>
> Kind regards,
> Fokko
>
> Op ma 4 nov 2024 om 14:03 schreef Jean-Baptiste Onofré <j...@nanthrax.net>:
>
>> Hi Ajantha,
>>
>> During CommunityOverCode, I chatted with Matt Topol about Substrait and
>> ADBC.
>> I checked the Substrait support in DataFusion and it's interesting.
>>
>> I was thinking about where to actually store the Substrait plan (I was
>> thinking about an intermediate SQL representation that we could store
>> as a metadata instead of directly the plan).
>>
>> Maybe, we could start with a proposal document to explore the
>> different options (and so follow Iceberg proposals process, creating a
>> GitHub Issue with the proposal tag, and attaching the document) ?
>>
>> Thanks !
>> Regards
>> JB
>>
>> On Mon, Nov 4, 2024 at 10:38 AM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>> >
>> > Thanks everyone for the detailed discussions.
>> >
>> > Looks like we have consensus towards Substrait.
>> > Last time I checked it was not adopted by all the engines. But we can
>> work towards the adoption as well.
>> >
>> > I will explore further on Substrait and come up with the design doc on
>> the same.
>> >
>> > Thanks,
>> > Ajantha
>> >
>> > On Mon, Oct 28, 2024 at 11:20 PM Amogh Jahagirdar <2am...@gmail.com>
>> wrote:
>> >>
>> >> Hey all,
>> >>
>> >> I'm +1 in efforts to make views more interoperable across engines as I
>> believe such efforts would be beneficial for the wider ecosystem. I think
>> the way to do that is through higher fidelity IRs such as Substrait.
>> >>
>> >> I agree with Walaa that there's not really a valid distinction between
>> IR vs non-IR projects when it comes to translation; my understanding is
>> that in the end any translation framework would have to normalize to an
>> intermediate representation. With the SQLGlot case, it's just that the IR
>> is at the AST level and with the others they have higher fidelity to
>> capture more accurate query semantics (correct me if I'm wrong here). As of
>> today, it is already possible to use SQLGlot, translate to the desired SQL
>> and store these SQL representations. However, since it's not as high
>> fidelity as a proper IR layer, there are issues to consider like Fokko
>> mentioned; but again, if users are happy with their results, they can do
>> this today without any spec changes.
>> >>
>> >> In my opinion, the biggest hurdle for Substrait or any other IR to be
>> a viable standard in Iceberg that's worth maintaining is that there would
>> need to be consensus across different engine/language communities (e.g.
>> Walaa referenced the Trino community's perspective on such IR layers).
>> Otherwise it risks becoming something that's defined in the standard but
>> really isn't well accepted which I think we all want to avoid.
>> >>
>> >> I think as a starting point, it would be great to sync with at least
>> OSS engines/language communities and try and understand any concrete points
>> of skepticism for considering such a standard. So far a lot of the points
>> of skepticism as I read it are around such a layer being only considerate
>> of 1 engine or having such substantial feature gaps that it can't be
>> considered; but no concrete cases have been called out.
>> >> Once we establish concrete gaps, I think then it would make sense to
>> work with the respective IR community to help close those gaps or if needed
>> consider other paths.
>> >>
>> >> Thanks,
>> >> Amogh Jahagirdar
>> >>
>> >> On Mon, Oct 28, 2024 at 11:43 AM Piotr Findeisen <
>> piotr.findei...@gmail.com> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I have no experience with Substrait, but i agree it looks like the
>> tool for the job.
>> >>> Or, as I proposed earlier, we define our own Iceberg IR for the views.
>> >>>
>> >>> We can experiment with serialized IR being stored as a String with
>> new dialect name, and this is how we should get this started.
>> >>> It's probably good end solution as well, but the important value-add
>> is if we manage to converge towards one shared IR that's "native to
>> iceberg".
>> >>> This would be best for the users -- more views would just work.
>> >>> And best for long-term evolution of the project -- standardized IR
>> would help things like incremental refreshes (for materialized views).
>> >>>
>> >>> Best
>> >>> Piotr
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>>
>> >>>> Hi Fokko,
>> >>>>
>> >>>> We can implement Python/Rust/Go clients to interop with the
>> serialized Coral IR. Not sure if it makes sense to have all front-end and
>> back-end implementations (e.g., Spark to Coral IR or Coral IR to Trino,
>> etc) be reimplemented in those languages. Such implementations actually
>> depend on the reuse of the native parsers of those dialects which are
>> typically in Java (also this is to your point about the language coverage
>> -- reusing native parsers is a principle that Coral follows to be compliant
>> with the source dialect). I think making Python/Rust/Go interop/handle the
>> IR (i.e., convert the serialized IR to in-memory IR and the other way
>> around) would be a good start. For example, Python-specific backends and
>> front-end implementations can follow from that.
>> >>>>
>> >>>> Thanks,
>> >>>> Walaa.
>> >>>>
>> >>>>
>> >>>> On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <fo...@apache.org>
>> wrote:
>> >>>>>
>> >>>>> Hey everyone,
>> >>>>>
>> >>>>> Views in PyIceberg are not yet as mature as in Java, mostly because
>> tooling in Python tends to work with data frames, rather than SQL. I do
>> think it would be valuable to extend support there.
>> >>>>>
>> >>>>> I have a bit of experience in turning SQL into ASTs and extending
>> grammar, and I'm confident to say that it is nearly impossible to cover all
>> the grammar of a specific dialect. My main question is, what will SQLGlot
>> do when we try to translate a dialect that it doesn't fully understand?
>> Will it error out, or will it produce faulty SQL? A simple example can be
>> functions that are not supported in other engines up to recursive CTE's. In
>> this case, not failing upfront would cause correctness issues.
>> >>>>>
>> >>>>> Regarding Substrait. Within PyIceberg there was also successful
>> experimentation of having a DuckDB query, sending it to PyIceberg to do the
>> Iceberg query planning, and returning a physical plan to DuckDB to do the
>> actual execution. This was still an early stage and required a lot of work
>> around credentials and field-IDs, but it was quite promising. Using
>> Substrait as views looks easier to me, and would also translate to a
>> dataframe-based world. Walaa, do you have any outlook on Coral
>> Python/Rust/Go support?
>> >>>>>
>> >>>>> Kind regards,
>> >>>>> Fokko
>> >>>>>
>> >>>>>
>> >>>>> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa <
>> wa.moust...@gmail.com>:
>> >>>>>>
>> >>>>>> I think this may need some more discussion.
>> >>>>>>
>> >>>>>> To me, a "serialized IR" is another form of a "dialect". In this
>> case, this dialect will be mostly specific to Iceberg, and compute engines
>> will still support reading views in their native SQL. There are some data
>> points on this from the Trino community in a previous discussion [1]. In
>> addition to being not directly consumable by engines, a serialized IR will
>> be hard to consume by humans too.
>> >>>>>>
>> >>>>>> From that perspective, even if Iceberg adopts some form of a
>> serialized IR, we will end up again doing translation, from that IR to the
>> engine's dialect on view read time, and from the engine's dialect to that
>> IR on the view write time. So serialized IR cannot eliminate translation.
>> >>>>>>
>> >>>>>> I think it is better to not quickly adopt the serialized IR path
>> until it is proven to work and there is sufficient tooling and support
>> around it, else it will end up being a constraint.
>> >>>>>>
>> >>>>>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are
>> some fundamental differences between their approaches, mainly around the
>> intermediate representation abstraction. Coral models both the AST and the
>> logical plan of a query, making it able to capture the query semantics more
>> accurately and hence perform precise transformations. On the flip side,
>> SQLGlot abstraction is at the AST level only. Data type inference would be
>> a major gap in any solution that does not capture the logical plan for
>> example, yet very important to perform successful translation. This is
>> backed up by some experiments we performed on actual queries and their
>> translation results (from Spark to Trino, comparing results of Coral and
>> SQLGlot).
>> >>>>>>
>> >>>>>> For the IR: Any translation solution (including Coral) must rely
>> on an IR, and it has to be decoupled from any of the input and output
>> dialects. This is true in the Coral case today. Such IR is the way to
>> represent both the intermediate AST and logical plans. Therefore, I do not
>> think we can necessarily split projects as "IR projects" vs not, since all
>> solutions must use an IR. With that said, IR serialization is a matter of
>> staging/milestones of the project. Serialized IR is next on Coral's
>> roadmap. If Iceberg ends up adopting an IR, it might be a good idea to make
>> Iceberg interoperable with a Coral-based serialized IR. This will make the
>> compatibility with engines that adopt Coral (like Trino) much more robust
>> and straightforward.
>> >>>>>>
>> >>>>>> [1]
>> https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Walaa.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>>
>

Re: [Discuss] Iceberg View Interoperability

Reply via email to