Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Julian Hyde Thu, 05 Aug 2021 10:59:37 -0700

Wes,

Thanks for this. I’ve added comments to the doc and to the PR.


The biggest surprise is that this language does full relational operations. I 
was expecting that it would do fragments of the operations. Consider join. A 
distributed hybrid hash join needs to partition rows into output buffers based 
on a hash key, build hash tables, probe into hash tables, scan hash tables for 
untouched “outer”rows, and so forth.

I see Arrow’s compute as delivering each of those operations, working on 
perhaps a batch at a time, or a sequence of batches, with some other system 
coordinating those tasks. So I would expect to see Arrow’s compute language 
mainly operating on batches rather than a table abstraction.

Julian


> On Aug 2, 2021, at 5:16 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> 
> hi folks,
> 
> This idea came up in passing in the past -- given that there are
> multiple independent efforts to develop Arrow-native query engines
> (and surely many more to come), it seems like it would be valuable to
> have a way to enable user languages (like Java, Python, R, or Rust,
> for example) to communicate with backend computing engines (like
> DataFusion, or new computing capabilities being built in the Arrow C++
> library) in a fashion that is "lower-level" than SQL and specialized
> to Arrow's type system. So rather than leaving it to a SQL parser /
> analyzer framework to generate an expression tree of relational
> operators and then translate that to an Arrow-native compute-engine's
> internal grammer, a user framework could provide the desired
> Arrow-native expression tree / data manipulations directly and skip
> the SQL altogether.
> 
> The idea of creating a "serialized intermediate representation (IR)"
> for Arrow compute operations would be to serve use cases large and
> small -- from the most complex TPC-* or time series database query to
> the most simple array predicate/filter sent with an RPC request using
> Arrow Flight. It is deliberately language- and engine-agnostic, with
> the only commonality across usages being the Arrow columnar format
> (schemas and array types). This would be better than leaving it to
> each application to develop its own bespoke expression representations
> for its needs.
> 
> I spent a while thinking about this and wrote up a brain dump RFC
> document [1] and accompanying pull request [2] that makes the possibly
> controversial choice of using Flatbuffers to represent the serialized
> IR. I discuss the rationale for the choice of Flatbuffers in the RFC
> document. This PR is obviously deficient in many regards (incomplete,
> hacky, or unclear in places), and will need some help from others to
> flesh out. I suspect that actually implementing the IR will be
> necessary to work out many of the low-level details.
> 
> Note that this IR is intended to be more of a "superset" project than
> a "lowest common denominator". So there may be things that it includes
> which are only available in some engines (e.g. engines that have
> specialized handling of time series data).
> 
> As some of my personal motivation for the project, concurrent with the
> genesis of Apache Arrow, I started a Python project called Ibis [3]
> (which is similar to R's dplyr project) which serves as a "Python
> analytical query IR builder" that is capable of generating most of the
> SQL standard, targeting many different SQL dialects and other backends
> (like pandas). Microsoft ([4]) and Google ([5]) have used this library
> as a "many-SQL" middleware. As such, I would like to be able to
> translate between the in-memory "logical query" data structures in a
> library like Ibis to a serialized format that can be executed by many
> different Arrow-native query engines. The expression primitives
> available in Ibis should serve as helpful test cases, too.
> 
> I look forward to the community's comments on the RFC document [1] and
> pull request [2] -- I realize that arriving at consensus on a complex
> and ambitious project like this can be challenging so I recommend
> spending time on the "non-goals" section in the RFC and ask questions
> if you are unclear about the scope of what problems this is trying to
> solve. I would be happy to give Edit access on the RFC document to
> others and would consider ideas about how to move forward with
> something that is able to be implemented by different Arrow libraries
> in the reasonably near future.
> 
> Thanks,
> Wes
> 
> [1]: 
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> [2]: https://github.com/apache/arrow/pull/10856
> [3]: https://ibis-project.org/
> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> [5]: 
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Reply via email to