Re: [DISCUSS] How to describe computation on Arrow data?

Wes McKinney Fri, 19 Mar 2021 17:26:45 -0700

> I might be misunderstanding, but I think Weld [1] is another project 
> targeting the lower level components?


Weld IR is _really_ low level (not an expert, but have read the
papers), see [1] for more

> Also, I think there was a little bit of effort to come up with a common 
> expression representation within C++, but got stalled on whether to use the 
> Gandiva expression descriptions or Flatbuffer

I'm fine to say "let's just use Protobuf". Trying to unify Gandiva's
and Ballista's protobuf defs for serializing expressions and physical
execution plans seems like a reasonable place to start. If some
enterprising person wanted to enable DataFusion to JIT certain
expressions with Gandiva, that would be some good evidence of being on
the right track (sounds like a good capstone project for an advanced
database systems course).

Per Andrew's comment:

> Thus focusing initially on a standard for expressions might be a good way to 
> add value but keep the scope of the effort reasonable

Here I think either we should base the exprs format either on what
Gandiva's done, and if that isn't general enough for some reason it
would be good to see the analysis / reasoning for doing it
differently. If everyone wants to do it a little differently, then we
might implement an alternate serialized interface to Gandiva
eventually.

I don't think we should get into SQL or SQL dialects at this moment,
at least not until we've shown we can create a standard for
lower-level query plans. SQL doesn't provide any indication of how the
query translates into an execution plan, so if the barrier to entry to
build an "engine" is to parse, analyze, optimize, and plan a query,
that is quite high. However, building an engine that knows how to
evaluate exprs and do hash aggregations and joins, that is a much
smaller scope problem.

Note that any "query plan" generated with this system may be in a way
somewhat higher level than what exactly the engine ends up doing. For
example, an engine might decide to push down predicates into scanners
or otherwise modify the precise order of operations.

[1]: https://github.com/weld-project/weld/blob/master/docs/language.md

On Fri, Mar 19, 2021 at 12:37 PM Antoine Pitrou <anto...@python.org> wrote:
>
>
> If we want this format to be common to different execution engines then
> it seems like it should represent logical expressions indeed (which may
> be implemented by different physical operators, depending on the
> execution engine).  But I'm no expert in the matter.
>
> Regards
>
> Antoine.
>
>
> Le 18/03/2021 à 17:22, Andrew Lamb a écrit :
> > Any higher level physical execution plan most likely needs a way to
> > represent expressions. Thus focusing initially on a standard for
> > expressions might be a good way to add value but keep the scope of the
> > effort reasonable
> >
> > On Thu, Mar 18, 2021 at 11:49 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> I think there might be discussion on two levels of computation, physical
> >> query execution plans, and potentially something "lower level"?  When this
> >> has come up in the past, I was a little skeptical of constraining every SDK
> >> to use the same description, so I agree with Wes's point about keeping any
> >> spec open in the short term.  Ballista as an opt-in model, does sound like
> >> possibly the right approach.
> >>
> >> I might be misunderstanding, but I think Weld [1] is another project
> >> targeting the lower level components?
> >>
> >> Also, I think there was a little bit of effort to come up with a common
> >> expression representation within C++, but got stalled on whether to use the
> >> Gandiva expression descriptions or Flatbuffers, I can't seem to find the
> >> thread/JIRA/discussion on this.  I'll try to look some more this evening.
> >>
> >> [1] https://github.com/weld-project/weld
> >>
> >> On Thu, Mar 18, 2021 at 7:53 AM Jed Brown <j...@jedbrown.org> wrote:
> >>
> >>> I'm interested in providing some path to make this extensible. To pick an
> >>> example, suppose the user wants to compute the first k principle
> >>> components. We've talked [1] about the possibility of incorporating
> >> richer
> >>> communication semantics in Ballista (a la MPI sub-communicators) and
> >>> numerical algorithms such as PCA would benefit. Those specific algorithms
> >>> wouldn't belong in Arrow or Ballista core, but I think there's an
> >>> opportunity for plugins to offer this sort of capability and it would be
> >>> lovely if the language-independent protocol could call them. Do you see a
> >>> good way to do this via ballista.proto?
> >>>
> >>> [1] https://github.com/ballista-compute/ballista/issues/303
> >>>
> >>> Andy Grove <andygrov...@gmail.com> writes:
> >>>
> >>>> Hi Paddy,
> >>>>
> >>>> Thanks for raising this.
> >>>>
> >>>> Ballista defines computations using protobuf [1] to describe logical
> >> and
> >>>> physical query plans, which consist of operators and expressions. It is
> >>>> actually based on the Gandiva protobuf [2] for describing expressions.
> >>>>
> >>>> I see a lot of value in standardizing some of this across
> >>> implementations.
> >>>> Ballista is essentially becoming a distributed scheduler for Arrow and
> >>> can
> >>>> work with any implementation that supports this protobuf definition of
> >>>> query plans.
> >>>>
> >>>> It would also make it easier to embed C++ in Rust, or Rust in C++,
> >> having
> >>>> this common IR, so I would be all for having something like this as an
> >>>> Arrow specification.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Andy.
> >>>>
> >>>> [1]
> >>>>
> >>>
> >> https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto
> >>>> [2]
> >>>>
> >>>
> >> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
> >>>>
> >>>>
> >>>> On Thu, Mar 18, 2021 at 7:40 AM paddy horan <paddyho...@hotmail.com>
> >>> wrote:
> >>>>
> >>>>> Hi All,
> >>>>>
> >>>>> I do not have a computer science background so I may not be asking
> >> this
> >>> in
> >>>>> the correct way or using the correct terminology but I wonder if we
> >> can
> >>>>> achieve some level of standardization when describing computation over
> >>>>> Arrow data.
> >>>>>
> >>>>> At the moment on the Rust side DataFusion clearly has a way to
> >> describe
> >>>>> computation, I believe that Ballista adds the ability to serialize
> >> this
> >>> to
> >>>>> allow distributed computation.  On the C++ side work is starting on a
> >>>>> similar query engine and we already have Gandiva.  Is there an
> >>> opportunity
> >>>>> to define a kind of IR for computation over Arrow data that could be
> >>>>> adopted across implementations?
> >>>>>
> >>>>> In this case DataFusion could easily incorporate Gandiva to generate
> >>>>> optimized compute kernels if they were using the same IR to describe
> >>>>> computation.  Applications built on Arrow could "describe" computation
> >>> in
> >>>>> any language and take advantage or innovations across the community,
> >>> adding
> >>>>> this to Arrow's zero copy data sharing could be a game changer in my
> >>> mind.
> >>>>> I'm not someone who knows enough to drive this forward but I obviously
> >>>>> would like to get involved.  For some time I was playing around with
> >>> using
> >>>>> TVM's relay IR [1] and applying it to Arrow data.
> >>>>>
> >>>>> As the Arrow memory format has now matured I fell like this could be
> >> the
> >>>>> next step.  Is there any plan for this kind of work or are we going to
> >>>>> allow sub-projects to "go their own way"?
> >>>>>
> >>>>> Thanks,
> >>>>> Paddy
> >>>>>
> >>>>> [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (
> >> apache.org
> >>> )<
> >>>>> https://tvm.apache.org/docs/dev/relay_intro.html>
> >>>>>
> >>>>>
> >>>
> >>
> >

Re: [DISCUSS] How to describe computation on Arrow data?

Reply via email to