First, I think you are correct that there is a lot of value to users here.
I'd love for a capability like this to someday be in pyarrow too for Arrow
compute functions.

I think there is a distinct enough difference between "a query language"
and "a programming language".  However, both of them are supersets of "an
expression language".  Is there not already an existing syntax that can be
used?  I would recommend at least looking at Postgres' language for
expression (value expressions)[1].  I don't think everything applies but I
do think you might want to eventually add a few more of the things found
there such as:

 * nested / list array subscripts (e.g. x.y.z == 50 || my_list_array[0] ==
false)
 * positional / named parameters (e.g. my_foo == $1)
 * parentheses to allow explicit precedence (e.g. (x + 5) * 20)

As an advocate of Substrait I will at least suggest looking into it and
possibly proposing a text format for Substrait expressions.

Some specific feedback follows:

> For functions with
>    multiple overloads, their return type is inferred from input types. For
>    commonly used functions, we would also like to support infix forms.

Is there a reason you don't have logical and (&&), logical or (||) and
logical negation (!)?  Actually, I see you call these out separately later
on (except you are maybe missing negation).  Is there any reason for the
distinction?

>    Ifs: We would like to support two grammars for if expressions:
>    - if(<cond>, <then>, <else>) for its simplicity and functional feel;
>       - if(<cond>) { <then> } else { <else> } since it’s the C++ if
grammar
>       and has better formatting for complex expressions.

Given the fact we are creating functional expressions and not procedural
code I would find the `if() {} else {}` syntax rather confusing.  Do you
have an example of one of these complex expressions?

>   InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is also
>   inferred.

What type?  The type of `<eval>`?  Why infer it?  I agree it can be
inferred but it seems like adding inference support (instead of just
assuming default types and rejecting invalid queries) is adding more
complexity for both parser and user (remember the rules of when I have to
type literals).  Are there other places where inference is needed?

> The grammar can be roughly represented as:

I'm not great at reading grammars but does your proposal support string
literals?

[1] https://www.postgresql.org/docs/current/sql-expressions.html

On Sun, Sep 18, 2022 at 9:12 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello,
>
> I would add that Gandiva does not seem to have a lot of active
> maintainers nowadays (as opposed to people who merely add functions for
> their own use cases).  If you would like to take such a responsability
> it would probably be very welcome.
>
> Regards
>
> Antoine.
>
>
>
> Le 16/09/2022 à 17:57, Jin Shang a écrit :
> > Background
> >
> > My team uses an expression computation library for our C++ feature
> > engineering pipeline. We currently use Exprtk
> > <https://github.com/ArashPartow/exprtk>. We recently tried out Gandiva
> and
> > wrote some benchmarks. We discovered that Gandiva is several times faster
> > than Exprtk in our use cases. Therefore we decided to switch to Gandiva
> for
> > computing expressions.
> > Objective
> >
> > As of current, due to its lack of a frontend, we need to manually
> construct
> > an AST to use Gandiva. This is inconvenient and requires extra learning
> > costs. We also want to enable our ML engineers to dynamically
> create/update
> > an expression with runtime hot-loaded configs without restarting our
> > server. This is currently impossible with Gandiva because the Expression
> > tree is statically created with C++ and must be compiled in the server
> > binary.
> >
> > Therefore, we would like to implement a parser frontend for Gandiva, so
> > that Gandiva becomes a standalone complete expression compiler and
> > evaluator, and a drop-in replacement for the existing libraries like
> Exprtk
> > <https://github.com/ArashPartow/exprtk> and TinyExpr
> > <https://github.com/codeplea/tinyexpr>. The goal is to enable the
> following
> > functionality:
> >
> > // Create schema for gandiva
> > auto field_x = arrow::field("x", arrow::uint64());
> > auto field_y = arrow::field("y", arrow::float64());
> > auto field_z = arrow::field("z", arrow::boolean());
> > auto schema = arrow::schema({field_x, field_y, field_z});
> >
> > /** BEGIN CHANGE **/
> > // Use the Parser to generate a NodePtr
> > std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
> > auto parser = gandiva::Parser(schema);
> > gandiva::NodePtr root_node;
> > auto status = parser.Parse(expr_str, &root_node);
> > /** END CHANGE **/
> >
> > // The rest is normal usage of Gandiva projector
> > auto expr_ptr = std::make_shared<gandiva::Expression>(root_node,
> result_field);
> > std::shared_ptr<gandiva::Projector> projector;
> > auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);
> >
> > auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
> > auto* pool = arrow::default_memory_pool();
> > arrow::ArrayVector outputs;
> > auto status = projector->Evaluate(*in_batch, pool, &outputs);
> >
> > The code block enclosed by “BEGIN CHANGE” and “END CHANGE” is the
> proposed
> > usage of the parser. It offers two benefits:
> >
> >     1. It’s more intuitive to write math expressions compared to
> >     constructing trees, thus easier to use.
> >     2. It allows dynamically adding new expressions or and changing
> existing
> >     ones with a runtime hot-loaded config file without restarting our
> server.
> >
> > Syntax
> >
> > The goal is to design a succinct and intuitive grammar for both schema
> and
> > Gandiva expressions. We will need a corresponding grammar for each Node
> > type in Gandiva.
> >
> >     -
> >
> >     Literals: We find Rust’s literal representation(
> >     <https://doc.rust-lang.org/rust-by-example/types/literals.html>
> >     https://doc.rust-lang.org/rust-by-example/types/literals.html) very
> >     intuitive. We’ll support suffixes such as i32, u64, f32 to denote a
> >     literal node’s type. The types of unsuffixed literals are inferred
> by their
> >     usage. Otherwise, the default type for integers is int32 and float32
> for
> >     floating points. String and binary literals are wrapped with single
> or
> >     double quotes. Decimal128 literals will not be supported in the first
> >     version.
> >     -
> >
> >     Fields: Just their names as defined in the schema. To avoid conflicts
> >     with other node types, field names must start with alphabetical
> letters.
> >     -
> >
> >     Functions: <function_name>(<param1>, <param2>, ...). For functions
> with
> >     multiple overloads, their return type is inferred from input types.
> For
> >     commonly used functions, we would also like to support infix forms.
> They
> >     include:
> >     - Comparisons: equal(==), not equal(!=), greater than(>), greater
> than
> >        or equal to(>=), less than(<), less than or equal to(<=)
> >        - Arithmetics: add(+), subtract(-), multiply(*), divide(/),
> >        modulo(%), power(^), bitwise and(&), bitwise or(|), bitwise
> > xor(^), bitwise
> >        not(~)
> >
> >     Function aliases with spaces in their names won’t be supported such
> as
> >     “is not false” are not supported.
> >     -
> >
> >     Ifs: We would like to support two grammars for if expressions:
> >     - if(<cond>, <then>, <else>) for its simplicity and functional feel;
> >        - if(<cond>) { <then> } else { <else> } since it’s the C++ if
> grammar
> >        and has better formatting for complex expressions.
> >     -
> >
> >     Booleans: We would like to support both && || and and or keywords the
> >     same as C++.
> >     -
> >
> >     InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is
> also
> >     inferred.
> >
> > The grammar can be roughly represented as:
> >
> > exp := term
> > term := literal | field | function | if | boolean | inexpression
> > literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL |
> > FLOAT_LITERAL_SUFFIX | "-" literal
> > field := IDENTIFIER
> > function := infix_function | named_function
> > infix_function := "-" term | term "+" term | term "-" term ...
> > named_function := IDENTIFIER(arguments)
> > arguments := term | argument "," term
> > if := "if" "(" term "," term "," term ")" | "if" "(" term ")" "{"
> > term"}" ELSE "{" term "}"
> > boolean := and_boolean | or_boolean
> > and_boolean := term AND term | term "&&" term
> > or_boolean := term OR term | term "||" term
> > inexpression := term IN "(" arguments ")"
> >
> > lower cases are non-terminals and upper cases are tokens.
> > Implementation Lexing and Parsing
> >
> > We would like to use flex and bison for lexing and parsing. They have
> > several advantages compared to other options like Antlr4 and
> Boost::Spirit.
> >
> >     1. They are the most classical and popular parsing library in the cpp
> >     world.
> >     2. They allow us to precompile the parser codes and have no runtime
> >     dependencies.
> >
> > Flex&bison takes a string and outputs a gandiva::node tree. The tree may
> > contain untyped nodes, e.g., unsuffixed literals and functions.
> > Type Inference
> >
> > We’ll have a TypeInferenceVisitor class that inherits node visitor,
> > implementing a 2-pass DFS algorithm to do type inference. In each pass,
> the
> > visitor tries to infer current node’s and its children’s types from
> > currently available info:
> >
> > If it’s a non-leaf node such as function ,if ,boolean:
> >
> >     1. First visit each child: let them infer their own types as much as
> >     they can.
> >     2. Create a SignaturePattern based on currently known types. The
> pattern
> >     includes the current node’s type and children’s types. The types can
> be
> >     nullptr meaning the type is currently unknown. For example, the
> >     SignaturePattern for func(x: u64, 5: untyped) will be (u64,
> >     nullptr)—>nullptr.
> >     3. Get all available signatures of the current node. For functions,
> it’s
> >     the signatures registered at the function registry. For if, it’s
> (bool,
> >     <type>, <type>)—><type> for any type etc.
> >     4. Try to match each signature with the current pattern
> SignaturePattern
> >        1. If no one matches, it’s a type error.
> >        2. If only one matches, it’s the matched signature. We then update
> >        current node’s and their children’s types based on this signature.
> >        3. If more than one matches, we extract a common pattern from the
> set
> >        of matched signatures. For example, (nullptr, nullptr)—>bool can
> be
> >        extracted from (double, double)—>bool and (int32, int32)—>bool .
> We
> >        then update types based on this common pattern.
> >
> > If it’s a leaf node, just update its value if its type is set by its
> > parent. We need to do this because untyped literals’ values are saved in
> a
> > generic container.
> >
> > We run this procedure 2 times, with the second pass a bit different. In
> the
> > second pass, if a literal node is still untyped, give it a default type (
> > int32 for ints and float32 for floats).
> >
> >     1. The first pass is bottom up propagation of types. A parent’s type
> is
> >     inferred from its children.
> >     2. The second pass is both:
> >        1. top down propagation of first pass’ info. Once a parent’s type
> is
> >        known in the first pass, it may set it’s children’s type
> >        2. bottom up of default literal types. A literal is given a
> default
> >        type.
> >
> > In the second pass, since all leaf nodes’ types are known. The types of
> all
> > nodes are guaranteed to be inferred.
> > Proof of correctness
> >
> > I have a proof of correctness of this inference procedure in mind, but
> it’s
> > too long to be written down here. (Fermat did it too 🙂) But the
> > correctness is based on these two facts:
> >
> >     1. There are no overloads that differ only on their return types.
> E.g.
> >     there are no func := (int32, int32)-->int32 and func := (int32,
> >     int32)-->double in the Gandiva registry. Therefore we can always
> know a
> >     function’s type if its children’s types are known.
> >     2. There are no overloads that accept only non-default numeric types.
> >     For example, if there is a func := (int16, int16)-->int16 and func :=
> >     (int64, int64)-->int64, but no func := (int32, int32)-->int32, then
> the
> >     inference procedure will fail on expression func(1, 2) because it
> cannot
> >     tell the types of 1 and 2, and giving them default types won't work.
> >
> > Prototype and Examples
> >
> > I’ve already written a mostly working prototype of the parser and type
> > inference visitor and unit tests for them. They are on the branch
> > <https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference>
> > https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference.
> >
> > You can checkout the diffs here:
> > <
> https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference
> >
> >
> https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference
> >
> > The main files are:
> >
> >     1. cpp/src/gandiva/grammar.yy: grammar rules for Bison.
> >     2. cpp/src/gandiva/lex.ll: lex rules for Flex.
> >     3. cpp/src/gandiva/typeinference.h/cc: type inference procedure.
> >     4. cpp/src/gandiva/parser.cc: the driver class that combines the
> three
> >     components.
> >     5. cpp/src/gandiva/parser_test.cc: unit tests containing examples of
> the
> >     proposed syntax and the result expression trees the parser
> generates. You
> >     can run the tests by running cmake .. --preset=ninja-debug-gandiva
> and ninja
> >     test-gandiva-tests.
> >
> > Any suggestion/question is appreciated!
> >
> >
> > Best regards,
> >
> > Jin Shang
> >
>

Reply via email to