Re: [DISCUSS] How to describe computation on Arrow data?

Wes McKinney Thu, 18 Mar 2021 07:39:26 -0700

I completely agree with developing a common “query protocol” or “physical
execution plan” IR + serialization scheme inside Apache Arrow. It may take
some time to stabilize so we should try to avoid being hasty in closing it
to change until more time has elapsed to allow requirements to percolate.


On Thu, Mar 18, 2021 at 8:17 AM Andy Grove <[email protected]> wrote:

> Hi Paddy,
>
> Thanks for raising this.
>
> Ballista defines computations using protobuf [1] to describe logical and
> physical query plans, which consist of operators and expressions. It is
> actually based on the Gandiva protobuf [2] for describing expressions.
>
> I see a lot of value in standardizing some of this across implementations.
> Ballista is essentially becoming a distributed scheduler for Arrow and can
> work with any implementation that supports this protobuf definition of
> query plans.
>
> It would also make it easier to embed C++ in Rust, or Rust in C++, having
> this common IR, so I would be all for having something like this as an
> Arrow specification.
>
> Thanks,
>
> Andy.
>
> [1]
>
> https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto
> [2]
>
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
>
>
> On Thu, Mar 18, 2021 at 7:40 AM paddy horan <[email protected]>
> wrote:
>
> > Hi All,
> >
> > I do not have a computer science background so I may not be asking this
> in
> > the correct way or using the correct terminology but I wonder if we can
> > achieve some level of standardization when describing computation over
> > Arrow data.
> >
> > At the moment on the Rust side DataFusion clearly has a way to describe
> > computation, I believe that Ballista adds the ability to serialize this
> to
> > allow distributed computation.  On the C++ side work is starting on a
> > similar query engine and we already have Gandiva.  Is there an
> opportunity
> > to define a kind of IR for computation over Arrow data that could be
> > adopted across implementations?
> >
> > In this case DataFusion could easily incorporate Gandiva to generate
> > optimized compute kernels if they were using the same IR to describe
> > computation.  Applications built on Arrow could "describe" computation in
> > any language and take advantage or innovations across the community,
> adding
> > this to Arrow's zero copy data sharing could be a game changer in my
> mind.
> > I'm not someone who knows enough to drive this forward but I obviously
> > would like to get involved.  For some time I was playing around with
> using
> > TVM's relay IR [1] and applying it to Arrow data.
> >
> > As the Arrow memory format has now matured I fell like this could be the
> > next step.  Is there any plan for this kind of work or are we going to
> > allow sub-projects to "go their own way"?
> >
> > Thanks,
> > Paddy
> >
> > [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (apache.org
> )<
> > https://tvm.apache.org/docs/dev/relay_intro.html>
> >
> >
>

Re: [DISCUSS] How to describe computation on Arrow data?

Reply via email to