Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Jorge Cardoso Leitão Wed, 11 Aug 2021 13:48:33 -0700

Couple of questions

1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
operation "(IR,data) - engine -> result" MUST be the same for all "engine"?
2. if yes, imo we may need to worry about:
* a definition of equality that implementations agree on.
* agreement over what the semantics look like. For example, do we use
kleene logic for AND and OR?


To try to understand the gist, let's pick an aggregated count over a
column: engines often do partial counts over partitions followed by a final
"sum" over the partial counts. Is the idea that the query engine would
communicate with the compute engine via 2 IRs where one is "count me these"
the other is "sum me these"?

Best,
Jorge





On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <[email protected]> wrote:

> Thanks Wes,
>
> Great to be back working on Arrow again and engaging with the community. I
> am really excited about this effort.
>
> I think there are a number of concerns I see as important to address in the
> compute IR proposal:
>
> 1. Requirement for output types.
>
> I think that so far there's been many reasons for requiring conforming IR
> producers and consumers to adhere to output types, but I haven't seen a
> strong rationale for keeping them optional (in the semantic sense, not WRT
> any particular serialization format's representation of optionality).
>
> I think a design that includes unambiguous semantics for output types (a
> consumer must produce a value of the requested type or it's an
> error/non-conforming) is simpler to reason about for producers, and
> provides a strong guarantee for end users (humans or machines constructing
> IR from an API and expecting the thing they ask for back from the IR
> consumer).
>
> 2. Flexibility
>
> The current PR is currently unable to support what I think are killer
> features of the IR: custom operators (relational or column) and UDFs. In my
> mind, on top of the generalized compute description that the IR offers, the
> ability for producers and consumers of IR to extend the IR without needing
> to modify Arrow or depend on anything except the format is itself something
> that is necessary to gain adoption.
>
> Developers will need to build custom relational operators (e.g., scans of
> backends that don't exist anywhere for which a user has code to implement)
> and custom functions (anything operating on a column that doesn't already
> exist, really). Furthermore, I think we can actually drive building an
> Arrow consumer using the same API that an end user would use to extend the
> IR.
>
> 3. Window Functions
>
> Window functions are, I think, an important part of the IR value
> proposition, as they are one of the more complex operators in databases. I
> think we need to have something in the initial IR proposal to support these
> operations.
>
> 4. Non relational Joins
>
> Things like as-of join and window join operators aren't yet fleshed out in
> the IR, and I'm not sure they should be in scope for the initial prototype.
> I think once we settle on a design, we can work the design of these
> particular operators out during the initial prototype. I think the
> specification of these operators should basically be PR #2 after the
> initial design lands.
>
> # Order of Work
>
> 1. Nail down the design. Anything else is a non-starter.
>
> 2. Prototype an IR producer using Ibis
>
> Ibis is IMO a good candidate for a first IR producer as it has a number of
> desirable properties that make prototyping faster and allow for us to
> refine the design of the IR as needed based on how the implementation goes:
> * It's written in Python so it has native support for nearly all of
> flatbuffers' features without having to creating bindings to C++.
> * There's already a set of rules for type checking, as well as APIs for
> constructing expression trees, which means we don't need to worry about
> building a type checker for the prototype.
>
> 3. Prototype an IR consumer in C++
>
> I think in parallel to the producer prototype we can further inform the
> design from the consumer side by prototyping an IR consumer in C++ . I know
> Ben Kietzman has expressed interest in working on this.
>
> Very interested to hear others' thoughts.
>
> -Phillip
>
> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <[email protected]> wrote:
>
> > Thank you for all the feedback and comments on the document. I'm on
> > vacation this week, so I'm delayed responding to everything, but I
> > will get to it as quickly as I can. I will be at VLDB in Copenhagen
> > next week if anyone would like to chat in person about it, and we can
> > relay the content of any discussions back to the document/PR/e-mail
> > thread.
> >
> > I know that Phillip Cloud expressed interest in working on the PR and
> > helping work through many of the details, so I'm glad to have the
> > help. If there are others who would like to work on the PR or dig into
> > the details, please let me know. We might need to figure out how to
> > accommodate "many cooks" by setting up the ComputeIR project somewhere
> > separate from the format/ directory to permit it to exist in a
> > Work-In-Progress status for a period of time until we work through the
> > various details and design concerns.
> >
> > Re Julian's comment
> >
> > > The biggest surprise is that this language does full relational
> > operations. I was expecting that it would do fragments of the operations.
> >
> > There's a related but different (yet still interesting and worthy of
> > analysis) problem of creating an "engine language" that describes more
> > mechanically the constituent parts of implementing the relational
> > operators. To create a functional computation language with concrete
> > Arrow data structures as a top-level primitive sounds like an
> > interesting research area where I could see something developing
> > eventually.
> >
> > The main problem I'm interested in solving right now is enabling front
> > ends that have sufficient understanding of relational algebra and data
> > frame operations to talk to engines without having to go backwards
> > from their logical query plans to SQL. So as mentioned in the
> > document, being able to faithfully carry the relational operator node
> > information generated by Calcite or Ibis or another system would be
> > super useful. Defining the semantics of various kinds of user-defined
> > functions would also be helpful to standardize the
> > engine-to-user-language UDF/extension interface.
> >
> > On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <[email protected]>
> wrote:
> > >
> > > Hi Wes,
> > >
> > > cool initiative! Reminded me of "Building Advanced SQL Analytics From
> > > Low-Level Plan Operators" from SIGMOD 2021 (
> > > http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> proposes a
> > > set of building block for advanced aggregation.
> > >
> > > Cheers,
> > > Dimitri.
> > >
> > > On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <[email protected]>
> > wrote:
> > >
> > > > Wes,
> > > >
> > > > Thanks for this. I’ve added comments to the doc and to the PR.
> > > >
> > > > The biggest surprise is that this language does full relational
> > > > operations. I was expecting that it would do fragments of the
> > operations.
> > > > Consider join. A distributed hybrid hash join needs to partition rows
> > into
> > > > output buffers based on a hash key, build hash tables, probe into
> hash
> > > > tables, scan hash tables for untouched “outer”rows, and so forth.
> > > >
> > > > I see Arrow’s compute as delivering each of those operations, working
> > on
> > > > perhaps a batch at a time, or a sequence of batches, with some other
> > system
> > > > coordinating those tasks. So I would expect to see Arrow’s compute
> > language
> > > > mainly operating on batches rather than a table abstraction.
> > > >
> > > > Julian
> > > >
> > > >
> > > > > On Aug 2, 2021, at 5:16 PM, Wes McKinney <[email protected]>
> > wrote:
> > > > >
> > > > > hi folks,
> > > > >
> > > > > This idea came up in passing in the past -- given that there are
> > > > > multiple independent efforts to develop Arrow-native query engines
> > > > > (and surely many more to come), it seems like it would be valuable
> to
> > > > > have a way to enable user languages (like Java, Python, R, or Rust,
> > > > > for example) to communicate with backend computing engines (like
> > > > > DataFusion, or new computing capabilities being built in the Arrow
> > C++
> > > > > library) in a fashion that is "lower-level" than SQL and
> specialized
> > > > > to Arrow's type system. So rather than leaving it to a SQL parser /
> > > > > analyzer framework to generate an expression tree of relational
> > > > > operators and then translate that to an Arrow-native
> compute-engine's
> > > > > internal grammer, a user framework could provide the desired
> > > > > Arrow-native expression tree / data manipulations directly and skip
> > > > > the SQL altogether.
> > > > >
> > > > > The idea of creating a "serialized intermediate representation
> (IR)"
> > > > > for Arrow compute operations would be to serve use cases large and
> > > > > small -- from the most complex TPC-* or time series database query
> to
> > > > > the most simple array predicate/filter sent with an RPC request
> using
> > > > > Arrow Flight. It is deliberately language- and engine-agnostic,
> with
> > > > > the only commonality across usages being the Arrow columnar format
> > > > > (schemas and array types). This would be better than leaving it to
> > > > > each application to develop its own bespoke expression
> > representations
> > > > > for its needs.
> > > > >
> > > > > I spent a while thinking about this and wrote up a brain dump RFC
> > > > > document [1] and accompanying pull request [2] that makes the
> > possibly
> > > > > controversial choice of using Flatbuffers to represent the
> serialized
> > > > > IR. I discuss the rationale for the choice of Flatbuffers in the
> RFC
> > > > > document. This PR is obviously deficient in many regards
> (incomplete,
> > > > > hacky, or unclear in places), and will need some help from others
> to
> > > > > flesh out. I suspect that actually implementing the IR will be
> > > > > necessary to work out many of the low-level details.
> > > > >
> > > > > Note that this IR is intended to be more of a "superset" project
> than
> > > > > a "lowest common denominator". So there may be things that it
> > includes
> > > > > which are only available in some engines (e.g. engines that have
> > > > > specialized handling of time series data).
> > > > >
> > > > > As some of my personal motivation for the project, concurrent with
> > the
> > > > > genesis of Apache Arrow, I started a Python project called Ibis [3]
> > > > > (which is similar to R's dplyr project) which serves as a "Python
> > > > > analytical query IR builder" that is capable of generating most of
> > the
> > > > > SQL standard, targeting many different SQL dialects and other
> > backends
> > > > > (like pandas). Microsoft ([4]) and Google ([5]) have used this
> > library
> > > > > as a "many-SQL" middleware. As such, I would like to be able to
> > > > > translate between the in-memory "logical query" data structures in
> a
> > > > > library like Ibis to a serialized format that can be executed by
> many
> > > > > different Arrow-native query engines. The expression primitives
> > > > > available in Ibis should serve as helpful test cases, too.
> > > > >
> > > > > I look forward to the community's comments on the RFC document [1]
> > and
> > > > > pull request [2] -- I realize that arriving at consensus on a
> complex
> > > > > and ambitious project like this can be challenging so I recommend
> > > > > spending time on the "non-goals" section in the RFC and ask
> questions
> > > > > if you are unclear about the scope of what problems this is trying
> to
> > > > > solve. I would be happy to give Edit access on the RFC document to
> > > > > others and would consider ideas about how to move forward with
> > > > > something that is able to be implemented by different Arrow
> libraries
> > > > > in the reasonably near future.
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > > > > [1]:
> > > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > [2]: https://github.com/apache/arrow/pull/10856
> > > > > [3]: https://ibis-project.org/
> > > > > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > [5]:
> > > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > >
> > > >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Reply via email to