Background

My team uses an expression computation library for our C++ feature
engineering pipeline. We currently use Exprtk
<https://github.com/ArashPartow/exprtk>. We recently tried out Gandiva and
wrote some benchmarks. We discovered that Gandiva is several times faster
than Exprtk in our use cases. Therefore we decided to switch to Gandiva for
computing expressions.
Objective

As of current, due to its lack of a frontend, we need to manually construct
an AST to use Gandiva. This is inconvenient and requires extra learning
costs. We also want to enable our ML engineers to dynamically create/update
an expression with runtime hot-loaded configs without restarting our
server. This is currently impossible with Gandiva because the Expression
tree is statically created with C++ and must be compiled in the server
binary.

Therefore, we would like to implement a parser frontend for Gandiva, so
that Gandiva becomes a standalone complete expression compiler and
evaluator, and a drop-in replacement for the existing libraries like Exprtk
<https://github.com/ArashPartow/exprtk> and TinyExpr
<https://github.com/codeplea/tinyexpr>. The goal is to enable the following
functionality:

// Create schema for gandiva
auto field_x = arrow::field("x", arrow::uint64());
auto field_y = arrow::field("y", arrow::float64());
auto field_z = arrow::field("z", arrow::boolean());
auto schema = arrow::schema({field_x, field_y, field_z});

/** BEGIN CHANGE **/
// Use the Parser to generate a NodePtr
std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
auto parser = gandiva::Parser(schema);
gandiva::NodePtr root_node;
auto status = parser.Parse(expr_str, &root_node);
/** END CHANGE **/

// The rest is normal usage of Gandiva projector
auto expr_ptr = std::make_shared<gandiva::Expression>(root_node, result_field);
std::shared_ptr<gandiva::Projector> projector;
auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);

auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
auto* pool = arrow::default_memory_pool();
arrow::ArrayVector outputs;
auto status = projector->Evaluate(*in_batch, pool, &outputs);

The code block enclosed by “BEGIN CHANGE” and “END CHANGE” is the proposed
usage of the parser. It offers two benefits:

   1. It’s more intuitive to write math expressions compared to
   constructing trees, thus easier to use.
   2. It allows dynamically adding new expressions or and changing existing
   ones with a runtime hot-loaded config file without restarting our server.

Syntax

The goal is to design a succinct and intuitive grammar for both schema and
Gandiva expressions. We will need a corresponding grammar for each Node
type in Gandiva.

   -

   Literals: We find Rust’s literal representation(
   <https://doc.rust-lang.org/rust-by-example/types/literals.html>
   https://doc.rust-lang.org/rust-by-example/types/literals.html) very
   intuitive. We’ll support suffixes such as i32, u64, f32 to denote a
   literal node’s type. The types of unsuffixed literals are inferred by their
   usage. Otherwise, the default type for integers is int32 and float32 for
   floating points. String and binary literals are wrapped with single or
   double quotes. Decimal128 literals will not be supported in the first
   version.
   -

   Fields: Just their names as defined in the schema. To avoid conflicts
   with other node types, field names must start with alphabetical letters.
   -

   Functions: <function_name>(<param1>, <param2>, ...). For functions with
   multiple overloads, their return type is inferred from input types. For
   commonly used functions, we would also like to support infix forms. They
   include:
   - Comparisons: equal(==), not equal(!=), greater than(>), greater than
      or equal to(>=), less than(<), less than or equal to(<=)
      - Arithmetics: add(+), subtract(-), multiply(*), divide(/),
      modulo(%), power(^), bitwise and(&), bitwise or(|), bitwise
xor(^), bitwise
      not(~)

   Function aliases with spaces in their names won’t be supported such as
   “is not false” are not supported.
   -

   Ifs: We would like to support two grammars for if expressions:
   - if(<cond>, <then>, <else>) for its simplicity and functional feel;
      - if(<cond>) { <then> } else { <else> } since it’s the C++ if grammar
      and has better formatting for complex expressions.
   -

   Booleans: We would like to support both && || and and or keywords the
   same as C++.
   -

   InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is also
   inferred.

The grammar can be roughly represented as:

exp := term
term := literal | field | function | if | boolean | inexpression
literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL |
FLOAT_LITERAL_SUFFIX | "-" literal
field := IDENTIFIER
function := infix_function | named_function
infix_function := "-" term | term "+" term | term "-" term ...
named_function := IDENTIFIER(arguments)
arguments := term | argument "," term
if := "if" "(" term "," term "," term ")" | "if" "(" term ")" "{"
term"}" ELSE "{" term "}"
boolean := and_boolean | or_boolean
and_boolean := term AND term | term "&&" term
or_boolean := term OR term | term "||" term
inexpression := term IN "(" arguments ")"

lower cases are non-terminals and upper cases are tokens.
Implementation Lexing and Parsing

We would like to use flex and bison for lexing and parsing. They have
several advantages compared to other options like Antlr4 and Boost::Spirit.

   1. They are the most classical and popular parsing library in the cpp
   world.
   2. They allow us to precompile the parser codes and have no runtime
   dependencies.

Flex&bison takes a string and outputs a gandiva::node tree. The tree may
contain untyped nodes, e.g., unsuffixed literals and functions.
Type Inference

We’ll have a TypeInferenceVisitor class that inherits node visitor,
implementing a 2-pass DFS algorithm to do type inference. In each pass, the
visitor tries to infer current node’s and its children’s types from
currently available info:

If it’s a non-leaf node such as function ,if ,boolean:

   1. First visit each child: let them infer their own types as much as
   they can.
   2. Create a SignaturePattern based on currently known types. The pattern
   includes the current node’s type and children’s types. The types can be
   nullptr meaning the type is currently unknown. For example, the
   SignaturePattern for func(x: u64, 5: untyped) will be (u64,
   nullptr)—>nullptr.
   3. Get all available signatures of the current node. For functions, it’s
   the signatures registered at the function registry. For if, it’s (bool,
   <type>, <type>)—><type> for any type etc.
   4. Try to match each signature with the current pattern SignaturePattern
      1. If no one matches, it’s a type error.
      2. If only one matches, it’s the matched signature. We then update
      current node’s and their children’s types based on this signature.
      3. If more than one matches, we extract a common pattern from the set
      of matched signatures. For example, (nullptr, nullptr)—>bool can be
      extracted from (double, double)—>bool and (int32, int32)—>bool . We
      then update types based on this common pattern.

If it’s a leaf node, just update its value if its type is set by its
parent. We need to do this because untyped literals’ values are saved in a
generic container.

We run this procedure 2 times, with the second pass a bit different. In the
second pass, if a literal node is still untyped, give it a default type (
int32 for ints and float32 for floats).

   1. The first pass is bottom up propagation of types. A parent’s type is
   inferred from its children.
   2. The second pass is both:
      1. top down propagation of first pass’ info. Once a parent’s type is
      known in the first pass, it may set it’s children’s type
      2. bottom up of default literal types. A literal is given a default
      type.

In the second pass, since all leaf nodes’ types are known. The types of all
nodes are guaranteed to be inferred.
Proof of correctness

I have a proof of correctness of this inference procedure in mind, but it’s
too long to be written down here. (Fermat did it too 🙂) But the
correctness is based on these two facts:

   1. There are no overloads that differ only on their return types. E.g.
   there are no func := (int32, int32)-->int32 and func := (int32,
   int32)-->double in the Gandiva registry. Therefore we can always know a
   function’s type if its children’s types are known.
   2. There are no overloads that accept only non-default numeric types.
   For example, if there is a func := (int16, int16)-->int16 and func :=
   (int64, int64)-->int64, but no func := (int32, int32)-->int32, then the
   inference procedure will fail on expression func(1, 2) because it cannot
   tell the types of 1 and 2, and giving them default types won't work.

Prototype and Examples

I’ve already written a mostly working prototype of the parser and type
inference visitor and unit tests for them. They are on the branch
<https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference>
https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference.

You can checkout the diffs here:
<https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference>
https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference

The main files are:

   1. cpp/src/gandiva/grammar.yy: grammar rules for Bison.
   2. cpp/src/gandiva/lex.ll: lex rules for Flex.
   3. cpp/src/gandiva/typeinference.h/cc: type inference procedure.
   4. cpp/src/gandiva/parser.cc: the driver class that combines the three
   components.
   5. cpp/src/gandiva/parser_test.cc: unit tests containing examples of the
   proposed syntax and the result expression trees the parser generates. You
   can run the tests by running cmake .. --preset=ninja-debug-gandiva and ninja
   test-gandiva-tests.

Any suggestion/question is appreciated!


Best regards,

Jin Shang

Reply via email to