Re: [DISCUSS] How to describe computation on Arrow data?

Andy Grove Thu, 18 Mar 2021 09:20:25 -0700

There was a Google design doc from back in 2019 [1] where we discussed this
(or something similar at least).


I also remember reading about Weld's IR. It would be good to learn from
their work.

[1]
https://docs.google.com/document/d/1Uv1FmPs7uYMLoJUH1EF0oxm-ujtz1h1tJFl0zN60TIg/edit?usp=sharing

On Thu, Mar 18, 2021 at 9:49 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I think there might be discussion on two levels of computation, physical
> query execution plans, and potentially something "lower level"?  When this
> has come up in the past, I was a little skeptical of constraining every SDK
> to use the same description, so I agree with Wes's point about keeping any
> spec open in the short term.  Ballista as an opt-in model, does sound like
> possibly the right approach.
>
> I might be misunderstanding, but I think Weld [1] is another project
> targeting the lower level components?
>
> Also, I think there was a little bit of effort to come up with a common
> expression representation within C++, but got stalled on whether to use the
> Gandiva expression descriptions or Flatbuffers, I can't seem to find the
> thread/JIRA/discussion on this.  I'll try to look some more this evening.
>
> [1] https://github.com/weld-project/weld
>
> On Thu, Mar 18, 2021 at 7:53 AM Jed Brown <j...@jedbrown.org> wrote:
>
>> I'm interested in providing some path to make this extensible. To pick an
>> example, suppose the user wants to compute the first k principle
>> components. We've talked [1] about the possibility of incorporating richer
>> communication semantics in Ballista (a la MPI sub-communicators) and
>> numerical algorithms such as PCA would benefit. Those specific algorithms
>> wouldn't belong in Arrow or Ballista core, but I think there's an
>> opportunity for plugins to offer this sort of capability and it would be
>> lovely if the language-independent protocol could call them. Do you see a
>> good way to do this via ballista.proto?
>>
>> [1] https://github.com/ballista-compute/ballista/issues/303
>>
>> Andy Grove <andygrov...@gmail.com> writes:
>>
>> > Hi Paddy,
>> >
>> > Thanks for raising this.
>> >
>> > Ballista defines computations using protobuf [1] to describe logical and
>> > physical query plans, which consist of operators and expressions. It is
>> > actually based on the Gandiva protobuf [2] for describing expressions.
>> >
>> > I see a lot of value in standardizing some of this across
>> implementations.
>> > Ballista is essentially becoming a distributed scheduler for Arrow and
>> can
>> > work with any implementation that supports this protobuf definition of
>> > query plans.
>> >
>> > It would also make it easier to embed C++ in Rust, or Rust in C++,
>> having
>> > this common IR, so I would be all for having something like this as an
>> > Arrow specification.
>> >
>> > Thanks,
>> >
>> > Andy.
>> >
>> > [1]
>> >
>> https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto
>> > [2]
>> >
>> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
>> >
>> >
>> > On Thu, Mar 18, 2021 at 7:40 AM paddy horan <paddyho...@hotmail.com>
>> wrote:
>> >
>> >> Hi All,
>> >>
>> >> I do not have a computer science background so I may not be asking
>> this in
>> >> the correct way or using the correct terminology but I wonder if we can
>> >> achieve some level of standardization when describing computation over
>> >> Arrow data.
>> >>
>> >> At the moment on the Rust side DataFusion clearly has a way to describe
>> >> computation, I believe that Ballista adds the ability to serialize
>> this to
>> >> allow distributed computation.  On the C++ side work is starting on a
>> >> similar query engine and we already have Gandiva.  Is there an
>> opportunity
>> >> to define a kind of IR for computation over Arrow data that could be
>> >> adopted across implementations?
>> >>
>> >> In this case DataFusion could easily incorporate Gandiva to generate
>> >> optimized compute kernels if they were using the same IR to describe
>> >> computation.  Applications built on Arrow could "describe" computation
>> in
>> >> any language and take advantage or innovations across the community,
>> adding
>> >> this to Arrow's zero copy data sharing could be a game changer in my
>> mind.
>> >> I'm not someone who knows enough to drive this forward but I obviously
>> >> would like to get involved.  For some time I was playing around with
>> using
>> >> TVM's relay IR [1] and applying it to Arrow data.
>> >>
>> >> As the Arrow memory format has now matured I fell like this could be
>> the
>> >> next step.  Is there any plan for this kind of work or are we going to
>> >> allow sub-projects to "go their own way"?
>> >>
>> >> Thanks,
>> >> Paddy
>> >>
>> >> [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (
>> apache.org)<
>> >> https://tvm.apache.org/docs/dev/relay_intro.html>
>> >>
>> >>
>>
>

Re: [DISCUSS] How to describe computation on Arrow data?

Reply via email to