There was a Google design doc from back in 2019 [1] where we discussed this (or something similar at least).
I also remember reading about Weld's IR. It would be good to learn from their work. [1] https://docs.google.com/document/d/1Uv1FmPs7uYMLoJUH1EF0oxm-ujtz1h1tJFl0zN60TIg/edit?usp=sharing On Thu, Mar 18, 2021 at 9:49 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > I think there might be discussion on two levels of computation, physical > query execution plans, and potentially something "lower level"? When this > has come up in the past, I was a little skeptical of constraining every SDK > to use the same description, so I agree with Wes's point about keeping any > spec open in the short term. Ballista as an opt-in model, does sound like > possibly the right approach. > > I might be misunderstanding, but I think Weld [1] is another project > targeting the lower level components? > > Also, I think there was a little bit of effort to come up with a common > expression representation within C++, but got stalled on whether to use the > Gandiva expression descriptions or Flatbuffers, I can't seem to find the > thread/JIRA/discussion on this. I'll try to look some more this evening. > > [1] https://github.com/weld-project/weld > > On Thu, Mar 18, 2021 at 7:53 AM Jed Brown <j...@jedbrown.org> wrote: > >> I'm interested in providing some path to make this extensible. To pick an >> example, suppose the user wants to compute the first k principle >> components. We've talked [1] about the possibility of incorporating richer >> communication semantics in Ballista (a la MPI sub-communicators) and >> numerical algorithms such as PCA would benefit. Those specific algorithms >> wouldn't belong in Arrow or Ballista core, but I think there's an >> opportunity for plugins to offer this sort of capability and it would be >> lovely if the language-independent protocol could call them. Do you see a >> good way to do this via ballista.proto? >> >> [1] https://github.com/ballista-compute/ballista/issues/303 >> >> Andy Grove <andygrov...@gmail.com> writes: >> >> > Hi Paddy, >> > >> > Thanks for raising this. >> > >> > Ballista defines computations using protobuf [1] to describe logical and >> > physical query plans, which consist of operators and expressions. It is >> > actually based on the Gandiva protobuf [2] for describing expressions. >> > >> > I see a lot of value in standardizing some of this across >> implementations. >> > Ballista is essentially becoming a distributed scheduler for Arrow and >> can >> > work with any implementation that supports this protobuf definition of >> > query plans. >> > >> > It would also make it easier to embed C++ in Rust, or Rust in C++, >> having >> > this common IR, so I would be all for having something like this as an >> > Arrow specification. >> > >> > Thanks, >> > >> > Andy. >> > >> > [1] >> > >> https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto >> > [2] >> > >> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto >> > >> > >> > On Thu, Mar 18, 2021 at 7:40 AM paddy horan <paddyho...@hotmail.com> >> wrote: >> > >> >> Hi All, >> >> >> >> I do not have a computer science background so I may not be asking >> this in >> >> the correct way or using the correct terminology but I wonder if we can >> >> achieve some level of standardization when describing computation over >> >> Arrow data. >> >> >> >> At the moment on the Rust side DataFusion clearly has a way to describe >> >> computation, I believe that Ballista adds the ability to serialize >> this to >> >> allow distributed computation. On the C++ side work is starting on a >> >> similar query engine and we already have Gandiva. Is there an >> opportunity >> >> to define a kind of IR for computation over Arrow data that could be >> >> adopted across implementations? >> >> >> >> In this case DataFusion could easily incorporate Gandiva to generate >> >> optimized compute kernels if they were using the same IR to describe >> >> computation. Applications built on Arrow could "describe" computation >> in >> >> any language and take advantage or innovations across the community, >> adding >> >> this to Arrow's zero copy data sharing could be a game changer in my >> mind. >> >> I'm not someone who knows enough to drive this forward but I obviously >> >> would like to get involved. For some time I was playing around with >> using >> >> TVM's relay IR [1] and applying it to Arrow data. >> >> >> >> As the Arrow memory format has now matured I fell like this could be >> the >> >> next step. Is there any plan for this kind of work or are we going to >> >> allow sub-projects to "go their own way"? >> >> >> >> Thanks, >> >> Paddy >> >> >> >> [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation ( >> apache.org)< >> >> https://tvm.apache.org/docs/dev/relay_intro.html> >> >> >> >> >> >