Wes, Thanks for this. I’ve added comments to the doc and to the PR.
The biggest surprise is that this language does full relational operations. I was expecting that it would do fragments of the operations. Consider join. A distributed hybrid hash join needs to partition rows into output buffers based on a hash key, build hash tables, probe into hash tables, scan hash tables for untouched “outer”rows, and so forth. I see Arrow’s compute as delivering each of those operations, working on perhaps a batch at a time, or a sequence of batches, with some other system coordinating those tasks. So I would expect to see Arrow’s compute language mainly operating on batches rather than a table abstraction. Julian > On Aug 2, 2021, at 5:16 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > hi folks, > > This idea came up in passing in the past -- given that there are > multiple independent efforts to develop Arrow-native query engines > (and surely many more to come), it seems like it would be valuable to > have a way to enable user languages (like Java, Python, R, or Rust, > for example) to communicate with backend computing engines (like > DataFusion, or new computing capabilities being built in the Arrow C++ > library) in a fashion that is "lower-level" than SQL and specialized > to Arrow's type system. So rather than leaving it to a SQL parser / > analyzer framework to generate an expression tree of relational > operators and then translate that to an Arrow-native compute-engine's > internal grammer, a user framework could provide the desired > Arrow-native expression tree / data manipulations directly and skip > the SQL altogether. > > The idea of creating a "serialized intermediate representation (IR)" > for Arrow compute operations would be to serve use cases large and > small -- from the most complex TPC-* or time series database query to > the most simple array predicate/filter sent with an RPC request using > Arrow Flight. It is deliberately language- and engine-agnostic, with > the only commonality across usages being the Arrow columnar format > (schemas and array types). This would be better than leaving it to > each application to develop its own bespoke expression representations > for its needs. > > I spent a while thinking about this and wrote up a brain dump RFC > document [1] and accompanying pull request [2] that makes the possibly > controversial choice of using Flatbuffers to represent the serialized > IR. I discuss the rationale for the choice of Flatbuffers in the RFC > document. This PR is obviously deficient in many regards (incomplete, > hacky, or unclear in places), and will need some help from others to > flesh out. I suspect that actually implementing the IR will be > necessary to work out many of the low-level details. > > Note that this IR is intended to be more of a "superset" project than > a "lowest common denominator". So there may be things that it includes > which are only available in some engines (e.g. engines that have > specialized handling of time series data). > > As some of my personal motivation for the project, concurrent with the > genesis of Apache Arrow, I started a Python project called Ibis [3] > (which is similar to R's dplyr project) which serves as a "Python > analytical query IR builder" that is capable of generating most of the > SQL standard, targeting many different SQL dialects and other backends > (like pandas). Microsoft ([4]) and Google ([5]) have used this library > as a "many-SQL" middleware. As such, I would like to be able to > translate between the in-memory "logical query" data structures in a > library like Ibis to a serialized format that can be executed by many > different Arrow-native query engines. The expression primitives > available in Ibis should serve as helpful test cases, too. > > I look forward to the community's comments on the RFC document [1] and > pull request [2] -- I realize that arriving at consensus on a complex > and ambitious project like this can be challenging so I recommend > spending time on the "non-goals" section in the RFC and ask questions > if you are unclear about the scope of what problems this is trying to > solve. I would be happy to give Edit access on the RFC document to > others and would consider ideas about how to move forward with > something that is able to be implemented by different Arrow libraries > in the reasonably near future. > > Thanks, > Wes > > [1]: > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit# > [2]: https://github.com/apache/arrow/pull/10856 > [3]: https://ibis-project.org/ > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf > [5]: > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt