hi folks,

This idea came up in passing in the past -- given that there are
multiple independent efforts to develop Arrow-native query engines
(and surely many more to come), it seems like it would be valuable to
have a way to enable user languages (like Java, Python, R, or Rust,
for example) to communicate with backend computing engines (like
DataFusion, or new computing capabilities being built in the Arrow C++
library) in a fashion that is "lower-level" than SQL and specialized
to Arrow's type system. So rather than leaving it to a SQL parser /
analyzer framework to generate an expression tree of relational
operators and then translate that to an Arrow-native compute-engine's
internal grammer, a user framework could provide the desired
Arrow-native expression tree / data manipulations directly and skip
the SQL altogether.

The idea of creating a "serialized intermediate representation (IR)"
for Arrow compute operations would be to serve use cases large and
small -- from the most complex TPC-* or time series database query to
the most simple array predicate/filter sent with an RPC request using
Arrow Flight. It is deliberately language- and engine-agnostic, with
the only commonality across usages being the Arrow columnar format
(schemas and array types). This would be better than leaving it to
each application to develop its own bespoke expression representations
for its needs.

I spent a while thinking about this and wrote up a brain dump RFC
document [1] and accompanying pull request [2] that makes the possibly
controversial choice of using Flatbuffers to represent the serialized
IR. I discuss the rationale for the choice of Flatbuffers in the RFC
document. This PR is obviously deficient in many regards (incomplete,
hacky, or unclear in places), and will need some help from others to
flesh out. I suspect that actually implementing the IR will be
necessary to work out many of the low-level details.

Note that this IR is intended to be more of a "superset" project than
a "lowest common denominator". So there may be things that it includes
which are only available in some engines (e.g. engines that have
specialized handling of time series data).

As some of my personal motivation for the project, concurrent with the
genesis of Apache Arrow, I started a Python project called Ibis [3]
(which is similar to R's dplyr project) which serves as a "Python
analytical query IR builder" that is capable of generating most of the
SQL standard, targeting many different SQL dialects and other backends
(like pandas). Microsoft ([4]) and Google ([5]) have used this library
as a "many-SQL" middleware. As such, I would like to be able to
translate between the in-memory "logical query" data structures in a
library like Ibis to a serialized format that can be executed by many
different Arrow-native query engines. The expression primitives
available in Ibis should serve as helpful test cases, too.

I look forward to the community's comments on the RFC document [1] and
pull request [2] -- I realize that arriving at consensus on a complex
and ambitious project like this can be challenging so I recommend
spending time on the "non-goals" section in the RFC and ask questions
if you are unclear about the scope of what problems this is trying to
solve. I would be happy to give Edit access on the RFC document to
others and would consider ideas about how to move forward with
something that is able to be implemented by different Arrow libraries
in the reasonably near future.

Thanks,
Wes

[1]: 
https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
[2]: https://github.com/apache/arrow/pull/10856
[3]: https://ibis-project.org/
[4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
[5]: 
https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt

Reply via email to