Hi, FYI:
This proposal was filed at https://issues.apache.org/jira/browse/ARROW-17668 . We have a similar proposal for Acero: https://issues.apache.org/jira/browse/ARROW-17351 There is a discussion about the proposal for Acero on ursalabs.zulipchat.com: https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Textual.20inputs.20for.20Acero If someone has an opinion for this proposal, please share your opinion. Thanks, -- kou In <CA+TczXx7=5z3VGsp0SSsUfKzkXk62MeCq1z8yFQ=m60txsu...@mail.gmail.com> "[C++][Gandiva] Proposal to Add A Parser Frontend for Gandiva" on Fri, 16 Sep 2022 23:57:34 +0800, Jin Shang <shangjin1...@gmail.com> wrote: > Background > > My team uses an expression computation library for our C++ feature > engineering pipeline. We currently use Exprtk > <https://github.com/ArashPartow/exprtk>. We recently tried out Gandiva and > wrote some benchmarks. We discovered that Gandiva is several times faster > than Exprtk in our use cases. Therefore we decided to switch to Gandiva for > computing expressions. > Objective > > As of current, due to its lack of a frontend, we need to manually construct > an AST to use Gandiva. This is inconvenient and requires extra learning > costs. We also want to enable our ML engineers to dynamically create/update > an expression with runtime hot-loaded configs without restarting our > server. This is currently impossible with Gandiva because the Expression > tree is statically created with C++ and must be compiled in the server > binary. > > Therefore, we would like to implement a parser frontend for Gandiva, so > that Gandiva becomes a standalone complete expression compiler and > evaluator, and a drop-in replacement for the existing libraries like Exprtk > <https://github.com/ArashPartow/exprtk> and TinyExpr > <https://github.com/codeplea/tinyexpr>. The goal is to enable the following > functionality: > > // Create schema for gandiva > auto field_x = arrow::field("x", arrow::uint64()); > auto field_y = arrow::field("y", arrow::float64()); > auto field_z = arrow::field("z", arrow::boolean()); > auto schema = arrow::schema({field_x, field_y, field_z}); > > /** BEGIN CHANGE **/ > // Use the Parser to generate a NodePtr > std::string expr_str = "if(z, castFloat8(x), y * 1000.0)"; > auto parser = gandiva::Parser(schema); > gandiva::NodePtr root_node; > auto status = parser.Parse(expr_str, &root_node); > /** END CHANGE **/ > > // The rest is normal usage of Gandiva projector > auto expr_ptr = std::make_shared<gandiva::Expression>(root_node, > result_field); > std::shared_ptr<gandiva::Projector> projector; > auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector); > > auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr); > auto* pool = arrow::default_memory_pool(); > arrow::ArrayVector outputs; > auto status = projector->Evaluate(*in_batch, pool, &outputs); > > The code block enclosed by “BEGIN CHANGE” and “END CHANGE” is the proposed > usage of the parser. It offers two benefits: > > 1. It’s more intuitive to write math expressions compared to > constructing trees, thus easier to use. > 2. It allows dynamically adding new expressions or and changing existing > ones with a runtime hot-loaded config file without restarting our server. > > Syntax > > The goal is to design a succinct and intuitive grammar for both schema and > Gandiva expressions. We will need a corresponding grammar for each Node > type in Gandiva. > > - > > Literals: We find Rust’s literal representation( > <https://doc.rust-lang.org/rust-by-example/types/literals.html> > https://doc.rust-lang.org/rust-by-example/types/literals.html) very > intuitive. We’ll support suffixes such as i32, u64, f32 to denote a > literal node’s type. The types of unsuffixed literals are inferred by their > usage. Otherwise, the default type for integers is int32 and float32 for > floating points. String and binary literals are wrapped with single or > double quotes. Decimal128 literals will not be supported in the first > version. > - > > Fields: Just their names as defined in the schema. To avoid conflicts > with other node types, field names must start with alphabetical letters. > - > > Functions: <function_name>(<param1>, <param2>, ...). For functions with > multiple overloads, their return type is inferred from input types. For > commonly used functions, we would also like to support infix forms. They > include: > - Comparisons: equal(==), not equal(!=), greater than(>), greater than > or equal to(>=), less than(<), less than or equal to(<=) > - Arithmetics: add(+), subtract(-), multiply(*), divide(/), > modulo(%), power(^), bitwise and(&), bitwise or(|), bitwise > xor(^), bitwise > not(~) > > Function aliases with spaces in their names won’t be supported such as > “is not false” are not supported. > - > > Ifs: We would like to support two grammars for if expressions: > - if(<cond>, <then>, <else>) for its simplicity and functional feel; > - if(<cond>) { <then> } else { <else> } since it’s the C++ if grammar > and has better formatting for complex expressions. > - > > Booleans: We would like to support both && || and and or keywords the > same as C++. > - > > InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is also > inferred. > > The grammar can be roughly represented as: > > exp := term > term := literal | field | function | if | boolean | inexpression > literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL | > FLOAT_LITERAL_SUFFIX | "-" literal > field := IDENTIFIER > function := infix_function | named_function > infix_function := "-" term | term "+" term | term "-" term ... > named_function := IDENTIFIER(arguments) > arguments := term | argument "," term > if := "if" "(" term "," term "," term ")" | "if" "(" term ")" "{" > term"}" ELSE "{" term "}" > boolean := and_boolean | or_boolean > and_boolean := term AND term | term "&&" term > or_boolean := term OR term | term "||" term > inexpression := term IN "(" arguments ")" > > lower cases are non-terminals and upper cases are tokens. > Implementation Lexing and Parsing > > We would like to use flex and bison for lexing and parsing. They have > several advantages compared to other options like Antlr4 and Boost::Spirit. > > 1. They are the most classical and popular parsing library in the cpp > world. > 2. They allow us to precompile the parser codes and have no runtime > dependencies. > > Flex&bison takes a string and outputs a gandiva::node tree. The tree may > contain untyped nodes, e.g., unsuffixed literals and functions. > Type Inference > > We’ll have a TypeInferenceVisitor class that inherits node visitor, > implementing a 2-pass DFS algorithm to do type inference. In each pass, the > visitor tries to infer current node’s and its children’s types from > currently available info: > > If it’s a non-leaf node such as function ,if ,boolean: > > 1. First visit each child: let them infer their own types as much as > they can. > 2. Create a SignaturePattern based on currently known types. The pattern > includes the current node’s type and children’s types. The types can be > nullptr meaning the type is currently unknown. For example, the > SignaturePattern for func(x: u64, 5: untyped) will be (u64, > nullptr)—>nullptr. > 3. Get all available signatures of the current node. For functions, it’s > the signatures registered at the function registry. For if, it’s (bool, > <type>, <type>)—><type> for any type etc. > 4. Try to match each signature with the current pattern SignaturePattern > 1. If no one matches, it’s a type error. > 2. If only one matches, it’s the matched signature. We then update > current node’s and their children’s types based on this signature. > 3. If more than one matches, we extract a common pattern from the set > of matched signatures. For example, (nullptr, nullptr)—>bool can be > extracted from (double, double)—>bool and (int32, int32)—>bool . We > then update types based on this common pattern. > > If it’s a leaf node, just update its value if its type is set by its > parent. We need to do this because untyped literals’ values are saved in a > generic container. > > We run this procedure 2 times, with the second pass a bit different. In the > second pass, if a literal node is still untyped, give it a default type ( > int32 for ints and float32 for floats). > > 1. The first pass is bottom up propagation of types. A parent’s type is > inferred from its children. > 2. The second pass is both: > 1. top down propagation of first pass’ info. Once a parent’s type is > known in the first pass, it may set it’s children’s type > 2. bottom up of default literal types. A literal is given a default > type. > > In the second pass, since all leaf nodes’ types are known. The types of all > nodes are guaranteed to be inferred. > Proof of correctness > > I have a proof of correctness of this inference procedure in mind, but it’s > too long to be written down here. (Fermat did it too 🙂) But the > correctness is based on these two facts: > > 1. There are no overloads that differ only on their return types. E.g. > there are no func := (int32, int32)-->int32 and func := (int32, > int32)-->double in the Gandiva registry. Therefore we can always know a > function’s type if its children’s types are known. > 2. There are no overloads that accept only non-default numeric types. > For example, if there is a func := (int16, int16)-->int16 and func := > (int64, int64)-->int64, but no func := (int32, int32)-->int32, then the > inference procedure will fail on expression func(1, 2) because it cannot > tell the types of 1 and 2, and giving them default types won't work. > > Prototype and Examples > > I’ve already written a mostly working prototype of the parser and type > inference visitor and unit tests for them. They are on the branch > <https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference> > https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference. > > You can checkout the diffs here: > <https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference> > https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference > > The main files are: > > 1. cpp/src/gandiva/grammar.yy: grammar rules for Bison. > 2. cpp/src/gandiva/lex.ll: lex rules for Flex. > 3. cpp/src/gandiva/typeinference.h/cc: type inference procedure. > 4. cpp/src/gandiva/parser.cc: the driver class that combines the three > components. > 5. cpp/src/gandiva/parser_test.cc: unit tests containing examples of the > proposed syntax and the result expression trees the parser generates. You > can run the tests by running cmake .. --preset=ninja-debug-gandiva and > ninja > test-gandiva-tests. > > Any suggestion/question is appreciated! > > > Best regards, > > Jin Shang