Hi,

FYI:

This proposal was filed at
https://issues.apache.org/jira/browse/ARROW-17668 .

We have a similar proposal for Acero:
https://issues.apache.org/jira/browse/ARROW-17351

There is a discussion about the proposal for Acero on
ursalabs.zulipchat.com:
https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Textual.20inputs.20for.20Acero

If someone has an opinion for this proposal, please share
your opinion.

Thanks,
-- 
kou

In <CA+TczXx7=5z3VGsp0SSsUfKzkXk62MeCq1z8yFQ=m60txsu...@mail.gmail.com>
  "[C++][Gandiva] Proposal to Add A Parser Frontend for Gandiva" on Fri, 16 Sep 
2022 23:57:34 +0800,
  Jin Shang <shangjin1...@gmail.com> wrote:

> Background
> 
> My team uses an expression computation library for our C++ feature
> engineering pipeline. We currently use Exprtk
> <https://github.com/ArashPartow/exprtk>. We recently tried out Gandiva and
> wrote some benchmarks. We discovered that Gandiva is several times faster
> than Exprtk in our use cases. Therefore we decided to switch to Gandiva for
> computing expressions.
> Objective
> 
> As of current, due to its lack of a frontend, we need to manually construct
> an AST to use Gandiva. This is inconvenient and requires extra learning
> costs. We also want to enable our ML engineers to dynamically create/update
> an expression with runtime hot-loaded configs without restarting our
> server. This is currently impossible with Gandiva because the Expression
> tree is statically created with C++ and must be compiled in the server
> binary.
> 
> Therefore, we would like to implement a parser frontend for Gandiva, so
> that Gandiva becomes a standalone complete expression compiler and
> evaluator, and a drop-in replacement for the existing libraries like Exprtk
> <https://github.com/ArashPartow/exprtk> and TinyExpr
> <https://github.com/codeplea/tinyexpr>. The goal is to enable the following
> functionality:
> 
> // Create schema for gandiva
> auto field_x = arrow::field("x", arrow::uint64());
> auto field_y = arrow::field("y", arrow::float64());
> auto field_z = arrow::field("z", arrow::boolean());
> auto schema = arrow::schema({field_x, field_y, field_z});
> 
> /** BEGIN CHANGE **/
> // Use the Parser to generate a NodePtr
> std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
> auto parser = gandiva::Parser(schema);
> gandiva::NodePtr root_node;
> auto status = parser.Parse(expr_str, &root_node);
> /** END CHANGE **/
> 
> // The rest is normal usage of Gandiva projector
> auto expr_ptr = std::make_shared<gandiva::Expression>(root_node, 
> result_field);
> std::shared_ptr<gandiva::Projector> projector;
> auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);
> 
> auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
> auto* pool = arrow::default_memory_pool();
> arrow::ArrayVector outputs;
> auto status = projector->Evaluate(*in_batch, pool, &outputs);
> 
> The code block enclosed by “BEGIN CHANGE” and “END CHANGE” is the proposed
> usage of the parser. It offers two benefits:
> 
>    1. It’s more intuitive to write math expressions compared to
>    constructing trees, thus easier to use.
>    2. It allows dynamically adding new expressions or and changing existing
>    ones with a runtime hot-loaded config file without restarting our server.
> 
> Syntax
> 
> The goal is to design a succinct and intuitive grammar for both schema and
> Gandiva expressions. We will need a corresponding grammar for each Node
> type in Gandiva.
> 
>    -
> 
>    Literals: We find Rust’s literal representation(
>    <https://doc.rust-lang.org/rust-by-example/types/literals.html>
>    https://doc.rust-lang.org/rust-by-example/types/literals.html) very
>    intuitive. We’ll support suffixes such as i32, u64, f32 to denote a
>    literal node’s type. The types of unsuffixed literals are inferred by their
>    usage. Otherwise, the default type for integers is int32 and float32 for
>    floating points. String and binary literals are wrapped with single or
>    double quotes. Decimal128 literals will not be supported in the first
>    version.
>    -
> 
>    Fields: Just their names as defined in the schema. To avoid conflicts
>    with other node types, field names must start with alphabetical letters.
>    -
> 
>    Functions: <function_name>(<param1>, <param2>, ...). For functions with
>    multiple overloads, their return type is inferred from input types. For
>    commonly used functions, we would also like to support infix forms. They
>    include:
>    - Comparisons: equal(==), not equal(!=), greater than(>), greater than
>       or equal to(>=), less than(<), less than or equal to(<=)
>       - Arithmetics: add(+), subtract(-), multiply(*), divide(/),
>       modulo(%), power(^), bitwise and(&), bitwise or(|), bitwise
> xor(^), bitwise
>       not(~)
> 
>    Function aliases with spaces in their names won’t be supported such as
>    “is not false” are not supported.
>    -
> 
>    Ifs: We would like to support two grammars for if expressions:
>    - if(<cond>, <then>, <else>) for its simplicity and functional feel;
>       - if(<cond>) { <then> } else { <else> } since it’s the C++ if grammar
>       and has better formatting for complex expressions.
>    -
> 
>    Booleans: We would like to support both && || and and or keywords the
>    same as C++.
>    -
> 
>    InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is also
>    inferred.
> 
> The grammar can be roughly represented as:
> 
> exp := term
> term := literal | field | function | if | boolean | inexpression
> literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL |
> FLOAT_LITERAL_SUFFIX | "-" literal
> field := IDENTIFIER
> function := infix_function | named_function
> infix_function := "-" term | term "+" term | term "-" term ...
> named_function := IDENTIFIER(arguments)
> arguments := term | argument "," term
> if := "if" "(" term "," term "," term ")" | "if" "(" term ")" "{"
> term"}" ELSE "{" term "}"
> boolean := and_boolean | or_boolean
> and_boolean := term AND term | term "&&" term
> or_boolean := term OR term | term "||" term
> inexpression := term IN "(" arguments ")"
> 
> lower cases are non-terminals and upper cases are tokens.
> Implementation Lexing and Parsing
> 
> We would like to use flex and bison for lexing and parsing. They have
> several advantages compared to other options like Antlr4 and Boost::Spirit.
> 
>    1. They are the most classical and popular parsing library in the cpp
>    world.
>    2. They allow us to precompile the parser codes and have no runtime
>    dependencies.
> 
> Flex&bison takes a string and outputs a gandiva::node tree. The tree may
> contain untyped nodes, e.g., unsuffixed literals and functions.
> Type Inference
> 
> We’ll have a TypeInferenceVisitor class that inherits node visitor,
> implementing a 2-pass DFS algorithm to do type inference. In each pass, the
> visitor tries to infer current node’s and its children’s types from
> currently available info:
> 
> If it’s a non-leaf node such as function ,if ,boolean:
> 
>    1. First visit each child: let them infer their own types as much as
>    they can.
>    2. Create a SignaturePattern based on currently known types. The pattern
>    includes the current node’s type and children’s types. The types can be
>    nullptr meaning the type is currently unknown. For example, the
>    SignaturePattern for func(x: u64, 5: untyped) will be (u64,
>    nullptr)—>nullptr.
>    3. Get all available signatures of the current node. For functions, it’s
>    the signatures registered at the function registry. For if, it’s (bool,
>    <type>, <type>)—><type> for any type etc.
>    4. Try to match each signature with the current pattern SignaturePattern
>       1. If no one matches, it’s a type error.
>       2. If only one matches, it’s the matched signature. We then update
>       current node’s and their children’s types based on this signature.
>       3. If more than one matches, we extract a common pattern from the set
>       of matched signatures. For example, (nullptr, nullptr)—>bool can be
>       extracted from (double, double)—>bool and (int32, int32)—>bool . We
>       then update types based on this common pattern.
> 
> If it’s a leaf node, just update its value if its type is set by its
> parent. We need to do this because untyped literals’ values are saved in a
> generic container.
> 
> We run this procedure 2 times, with the second pass a bit different. In the
> second pass, if a literal node is still untyped, give it a default type (
> int32 for ints and float32 for floats).
> 
>    1. The first pass is bottom up propagation of types. A parent’s type is
>    inferred from its children.
>    2. The second pass is both:
>       1. top down propagation of first pass’ info. Once a parent’s type is
>       known in the first pass, it may set it’s children’s type
>       2. bottom up of default literal types. A literal is given a default
>       type.
> 
> In the second pass, since all leaf nodes’ types are known. The types of all
> nodes are guaranteed to be inferred.
> Proof of correctness
> 
> I have a proof of correctness of this inference procedure in mind, but it’s
> too long to be written down here. (Fermat did it too 🙂) But the
> correctness is based on these two facts:
> 
>    1. There are no overloads that differ only on their return types. E.g.
>    there are no func := (int32, int32)-->int32 and func := (int32,
>    int32)-->double in the Gandiva registry. Therefore we can always know a
>    function’s type if its children’s types are known.
>    2. There are no overloads that accept only non-default numeric types.
>    For example, if there is a func := (int16, int16)-->int16 and func :=
>    (int64, int64)-->int64, but no func := (int32, int32)-->int32, then the
>    inference procedure will fail on expression func(1, 2) because it cannot
>    tell the types of 1 and 2, and giving them default types won't work.
> 
> Prototype and Examples
> 
> I’ve already written a mostly working prototype of the parser and type
> inference visitor and unit tests for them. They are on the branch
> <https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference>
> https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference.
> 
> You can checkout the diffs here:
> <https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference>
> https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference
> 
> The main files are:
> 
>    1. cpp/src/gandiva/grammar.yy: grammar rules for Bison.
>    2. cpp/src/gandiva/lex.ll: lex rules for Flex.
>    3. cpp/src/gandiva/typeinference.h/cc: type inference procedure.
>    4. cpp/src/gandiva/parser.cc: the driver class that combines the three
>    components.
>    5. cpp/src/gandiva/parser_test.cc: unit tests containing examples of the
>    proposed syntax and the result expression trees the parser generates. You
>    can run the tests by running cmake .. --preset=ninja-debug-gandiva and 
> ninja
>    test-gandiva-tests.
> 
> Any suggestion/question is appreciated!
> 
> 
> Best regards,
> 
> Jin Shang

Reply via email to