Thanks Wes, Great to be back working on Arrow again and engaging with the community. I am really excited about this effort.
I think there are a number of concerns I see as important to address in the compute IR proposal: 1. Requirement for output types. I think that so far there's been many reasons for requiring conforming IR producers and consumers to adhere to output types, but I haven't seen a strong rationale for keeping them optional (in the semantic sense, not WRT any particular serialization format's representation of optionality). I think a design that includes unambiguous semantics for output types (a consumer must produce a value of the requested type or it's an error/non-conforming) is simpler to reason about for producers, and provides a strong guarantee for end users (humans or machines constructing IR from an API and expecting the thing they ask for back from the IR consumer). 2. Flexibility The current PR is currently unable to support what I think are killer features of the IR: custom operators (relational or column) and UDFs. In my mind, on top of the generalized compute description that the IR offers, the ability for producers and consumers of IR to extend the IR without needing to modify Arrow or depend on anything except the format is itself something that is necessary to gain adoption. Developers will need to build custom relational operators (e.g., scans of backends that don't exist anywhere for which a user has code to implement) and custom functions (anything operating on a column that doesn't already exist, really). Furthermore, I think we can actually drive building an Arrow consumer using the same API that an end user would use to extend the IR. 3. Window Functions Window functions are, I think, an important part of the IR value proposition, as they are one of the more complex operators in databases. I think we need to have something in the initial IR proposal to support these operations. 4. Non relational Joins Things like as-of join and window join operators aren't yet fleshed out in the IR, and I'm not sure they should be in scope for the initial prototype. I think once we settle on a design, we can work the design of these particular operators out during the initial prototype. I think the specification of these operators should basically be PR #2 after the initial design lands. # Order of Work 1. Nail down the design. Anything else is a non-starter. 2. Prototype an IR producer using Ibis Ibis is IMO a good candidate for a first IR producer as it has a number of desirable properties that make prototyping faster and allow for us to refine the design of the IR as needed based on how the implementation goes: * It's written in Python so it has native support for nearly all of flatbuffers' features without having to creating bindings to C++. * There's already a set of rules for type checking, as well as APIs for constructing expression trees, which means we don't need to worry about building a type checker for the prototype. 3. Prototype an IR consumer in C++ I think in parallel to the producer prototype we can further inform the design from the consumer side by prototyping an IR consumer in C++ . I know Ben Kietzman has expressed interest in working on this. Very interested to hear others' thoughts. -Phillip On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <wesmck...@gmail.com> wrote: > Thank you for all the feedback and comments on the document. I'm on > vacation this week, so I'm delayed responding to everything, but I > will get to it as quickly as I can. I will be at VLDB in Copenhagen > next week if anyone would like to chat in person about it, and we can > relay the content of any discussions back to the document/PR/e-mail > thread. > > I know that Phillip Cloud expressed interest in working on the PR and > helping work through many of the details, so I'm glad to have the > help. If there are others who would like to work on the PR or dig into > the details, please let me know. We might need to figure out how to > accommodate "many cooks" by setting up the ComputeIR project somewhere > separate from the format/ directory to permit it to exist in a > Work-In-Progress status for a period of time until we work through the > various details and design concerns. > > Re Julian's comment > > > The biggest surprise is that this language does full relational > operations. I was expecting that it would do fragments of the operations. > > There's a related but different (yet still interesting and worthy of > analysis) problem of creating an "engine language" that describes more > mechanically the constituent parts of implementing the relational > operators. To create a functional computation language with concrete > Arrow data structures as a top-level primitive sounds like an > interesting research area where I could see something developing > eventually. > > The main problem I'm interested in solving right now is enabling front > ends that have sufficient understanding of relational algebra and data > frame operations to talk to engines without having to go backwards > from their logical query plans to SQL. So as mentioned in the > document, being able to faithfully carry the relational operator node > information generated by Calcite or Ibis or another system would be > super useful. Defining the semantics of various kinds of user-defined > functions would also be helpful to standardize the > engine-to-user-language UDF/extension interface. > > On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <alen...@gmail.com> wrote: > > > > Hi Wes, > > > > cool initiative! Reminded me of "Building Advanced SQL Analytics From > > Low-Level Plan Operators" from SIGMOD 2021 ( > > http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which proposes a > > set of building block for advanced aggregation. > > > > Cheers, > > Dimitri. > > > > On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jhyde.apa...@gmail.com> > wrote: > > > > > Wes, > > > > > > Thanks for this. I’ve added comments to the doc and to the PR. > > > > > > The biggest surprise is that this language does full relational > > > operations. I was expecting that it would do fragments of the > operations. > > > Consider join. A distributed hybrid hash join needs to partition rows > into > > > output buffers based on a hash key, build hash tables, probe into hash > > > tables, scan hash tables for untouched “outer”rows, and so forth. > > > > > > I see Arrow’s compute as delivering each of those operations, working > on > > > perhaps a batch at a time, or a sequence of batches, with some other > system > > > coordinating those tasks. So I would expect to see Arrow’s compute > language > > > mainly operating on batches rather than a table abstraction. > > > > > > Julian > > > > > > > > > > On Aug 2, 2021, at 5:16 PM, Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > > hi folks, > > > > > > > > This idea came up in passing in the past -- given that there are > > > > multiple independent efforts to develop Arrow-native query engines > > > > (and surely many more to come), it seems like it would be valuable to > > > > have a way to enable user languages (like Java, Python, R, or Rust, > > > > for example) to communicate with backend computing engines (like > > > > DataFusion, or new computing capabilities being built in the Arrow > C++ > > > > library) in a fashion that is "lower-level" than SQL and specialized > > > > to Arrow's type system. So rather than leaving it to a SQL parser / > > > > analyzer framework to generate an expression tree of relational > > > > operators and then translate that to an Arrow-native compute-engine's > > > > internal grammer, a user framework could provide the desired > > > > Arrow-native expression tree / data manipulations directly and skip > > > > the SQL altogether. > > > > > > > > The idea of creating a "serialized intermediate representation (IR)" > > > > for Arrow compute operations would be to serve use cases large and > > > > small -- from the most complex TPC-* or time series database query to > > > > the most simple array predicate/filter sent with an RPC request using > > > > Arrow Flight. It is deliberately language- and engine-agnostic, with > > > > the only commonality across usages being the Arrow columnar format > > > > (schemas and array types). This would be better than leaving it to > > > > each application to develop its own bespoke expression > representations > > > > for its needs. > > > > > > > > I spent a while thinking about this and wrote up a brain dump RFC > > > > document [1] and accompanying pull request [2] that makes the > possibly > > > > controversial choice of using Flatbuffers to represent the serialized > > > > IR. I discuss the rationale for the choice of Flatbuffers in the RFC > > > > document. This PR is obviously deficient in many regards (incomplete, > > > > hacky, or unclear in places), and will need some help from others to > > > > flesh out. I suspect that actually implementing the IR will be > > > > necessary to work out many of the low-level details. > > > > > > > > Note that this IR is intended to be more of a "superset" project than > > > > a "lowest common denominator". So there may be things that it > includes > > > > which are only available in some engines (e.g. engines that have > > > > specialized handling of time series data). > > > > > > > > As some of my personal motivation for the project, concurrent with > the > > > > genesis of Apache Arrow, I started a Python project called Ibis [3] > > > > (which is similar to R's dplyr project) which serves as a "Python > > > > analytical query IR builder" that is capable of generating most of > the > > > > SQL standard, targeting many different SQL dialects and other > backends > > > > (like pandas). Microsoft ([4]) and Google ([5]) have used this > library > > > > as a "many-SQL" middleware. As such, I would like to be able to > > > > translate between the in-memory "logical query" data structures in a > > > > library like Ibis to a serialized format that can be executed by many > > > > different Arrow-native query engines. The expression primitives > > > > available in Ibis should serve as helpful test cases, too. > > > > > > > > I look forward to the community's comments on the RFC document [1] > and > > > > pull request [2] -- I realize that arriving at consensus on a complex > > > > and ambitious project like this can be challenging so I recommend > > > > spending time on the "non-goals" section in the RFC and ask questions > > > > if you are unclear about the scope of what problems this is trying to > > > > solve. I would be happy to give Edit access on the RFC document to > > > > others and would consider ideas about how to move forward with > > > > something that is able to be implemented by different Arrow libraries > > > > in the reasonably near future. > > > > > > > > Thanks, > > > > Wes > > > > > > > > [1]: > > > > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit# > > > > [2]: https://github.com/apache/arrow/pull/10856 > > > > [3]: https://ibis-project.org/ > > > > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf > > > > [5]: > > > > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt > > > > > > >