[DISCUSS] How to describe computation on Arrow data?

paddy horan Thu, 18 Mar 2021 06:40:10 -0700

Hi All,

I do not have a computer science background so I may not be asking this in the 
correct way or using the correct terminology but I wonder if we can achieve 
some level of standardization when describing computation over Arrow data.


At the moment on the Rust side DataFusion clearly has a way to describe 
computation, I believe that Ballista adds the ability to serialize this to 
allow distributed computation.  On the C++ side work is starting on a similar 
query engine and we already have Gandiva.  Is there an opportunity to define a 
kind of IR for computation over Arrow data that could be adopted across 
implementations?

In this case DataFusion could easily incorporate Gandiva to generate optimized 
compute kernels if they were using the same IR to describe computation.  
Applications built on Arrow could "describe" computation in any language and 
take advantage or innovations across the community, adding this to Arrow's zero 
copy data sharing could be a game changer in my mind.  I'm not someone who 
knows enough to drive this forward but I obviously would like to get involved.  
For some time I was playing around with using TVM's relay IR [1] and applying 
it to Arrow data.

As the Arrow memory format has now matured I fell like this could be the next 
step.  Is there any plan for this kind of work or are we going to allow 
sub-projects to "go their own way"?

Thanks,
Paddy

[1] - Introduction to Relay IR - tvm 0.8.dev0 documentation 
(apache.org)<https://tvm.apache.org/docs/dev/relay_intro.html>

[DISCUSS] How to describe computation on Arrow data?

Reply via email to