Hi Weston, Thank you for your reply! I would like to address some of your concerns.
> 2022年9月20日 00:59,Weston Pace <weston.p...@gmail.com> 写道: > > First, I think you are correct that there is a lot of value to users here. > I'd love for a capability like this to someday be in pyarrow too for Arrow > compute functions. > > I think there is a distinct enough difference between "a query language" > and "a programming language". However, both of them are supersets of "an > expression language". Is there not already an existing syntax that can be > used? I would recommend at least looking at Postgres' language for > expression (value expressions)[1]. I don't think everything applies but I > do think you might want to eventually add a few more of the things found > there such as: > This proposal was intended to provide a textual representation of Gandiva expression, hence the proposed syntax is simply a direct translation of Gandiva’s existing AST. So far, Gandiva’s AST has 5 types of nodes: 1. Fields: placeholders for input. 2. Functions: same as the name suggests. 3. Literals: same as the name suggests. 4. Booleans: logical AND/OR relationship between a series of nodes. 5. In expression: whether the result of a expression is in a set of literals. > * nested / list array subscripts (e.g. x.y.z == 50 || my_list_array[0] == > false) > * positional / named parameters (e.g. my_foo == $1) As we can see, the expressiveness of this AST is very limited. Unfortunately none of nested types, lists or variable assignments is supported by Gandiva. > * parentheses to allow explicit precedence (e.g. (x + 5) * 20) I implement parentheses but forgot to add it to the specification. I will attach a fixed version at the end of this email. > > As an advocate of Substrait I will at least suggest looking into it and > possibly proposing a text format for Substrait expressions. > I have also been experimenting with Acero. I would possibly make a proposal for Substrait if my team and I find a textual format helpful. > Some specific feedback follows: > >> For functions with >> multiple overloads, their return type is inferred from input types. For >> commonly used functions, we would also like to support infix forms. > > Is there a reason you don't have logical and (&&), logical or (||) and > logical negation (!)? Actually, I see you call these out separately later > on (except you are maybe missing negation). Is there any reason for the > distinction? > All three operators are supported, although in different ways. Gandiva’s AST separates out logical AND/OR as a separate type of node. I don’t know of the exact reason but my guess is AND/OR can take variable number of arguments and Gandiva doesn’t support variadic functions. Logical negation always takes one argument, so it’s just a normal function. And I also implemented the infix form for it. >> Ifs: We would like to support two grammars for if expressions: >> - if(<cond>, <then>, <else>) for its simplicity and functional feel; >> - if(<cond>) { <then> } else { <else> } since it’s the C++ if > grammar >> and has better formatting for complex expressions. > > Given the fact we are creating functional expressions and not procedural > code I would find the `if() {} else {}` syntax rather confusing. Do you > have an example of one of these complex expressions? > I agree. I’ll remove the if {} else {} version. >> InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is also >> inferred. > > What type? The type of `<eval>`? Why infer it? I agree it can be > inferred but it seems like adding inference support (instead of just > assuming default types and rejecting invalid queries) is adding more > complexity for both parser and user (remember the rules of when I have to > type literals). Are there other places where inference is needed? > The type of the whole expression "<expr> IN (…)”. I made a mistake here: the type of InExpressions is always boolean. Thanks for pointing out! >> The grammar can be roughly represented as: > > I'm not great at reading grammars but does your proposal support string > literals? > I implemented string literals but forgot to put it in the specification. Here is a fixed grammar, with changes marked in red: exp := term term := literal | field | function | if | boolean | inexpression | ”(“ term “)" literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL | FLOAT_LITERAL_SUFFIX | STRING_LITERAL field := IDENTIFIER function := infix_function | named_function infix_function := "-" term | term "+" term | term "-" term ... named_function := IDENTIFIER(arguments) arguments := term | argument "," term if := "if" "(" term "," term "," term ")" | "if" "(" term ")" "{" term"}" ELSE "{" term "}" boolean := and_boolean | or_boolean and_boolean := term AND term | term "&&" term or_boolean := term OR term | term "||" term inexpression := term IN "(" arguments ")" Let me know if you have more feedback or questions! Best regards, Jin > [1] https://www.postgresql.org/docs/current/sql-expressions.html > > On Sun, Sep 18, 2022 at 9:12 AM Antoine Pitrou <anto...@python.org> wrote: > >> >> Hello, >> >> I would add that Gandiva does not seem to have a lot of active >> maintainers nowadays (as opposed to people who merely add functions for >> their own use cases). If you would like to take such a responsability >> it would probably be very welcome. >> >> Regards >> >> Antoine. >> >> >> >> Le 16/09/2022 à 17:57, Jin Shang a écrit : >>> Background >>> >>> My team uses an expression computation library for our C++ feature >>> engineering pipeline. We currently use Exprtk >>> <https://github.com/ArashPartow/exprtk>. We recently tried out Gandiva >> and >>> wrote some benchmarks. We discovered that Gandiva is several times faster >>> than Exprtk in our use cases. Therefore we decided to switch to Gandiva >> for >>> computing expressions. >>> Objective >>> >>> As of current, due to its lack of a frontend, we need to manually >> construct >>> an AST to use Gandiva. This is inconvenient and requires extra learning >>> costs. We also want to enable our ML engineers to dynamically >> create/update >>> an expression with runtime hot-loaded configs without restarting our >>> server. This is currently impossible with Gandiva because the Expression >>> tree is statically created with C++ and must be compiled in the server >>> binary. >>> >>> Therefore, we would like to implement a parser frontend for Gandiva, so >>> that Gandiva becomes a standalone complete expression compiler and >>> evaluator, and a drop-in replacement for the existing libraries like >> Exprtk >>> <https://github.com/ArashPartow/exprtk> and TinyExpr >>> <https://github.com/codeplea/tinyexpr>. The goal is to enable the >> following >>> functionality: >>> >>> // Create schema for gandiva >>> auto field_x = arrow::field("x", arrow::uint64()); >>> auto field_y = arrow::field("y", arrow::float64()); >>> auto field_z = arrow::field("z", arrow::boolean()); >>> auto schema = arrow::schema({field_x, field_y, field_z}); >>> >>> /** BEGIN CHANGE **/ >>> // Use the Parser to generate a NodePtr >>> std::string expr_str = "if(z, castFloat8(x), y * 1000.0)"; >>> auto parser = gandiva::Parser(schema); >>> gandiva::NodePtr root_node; >>> auto status = parser.Parse(expr_str, &root_node); >>> /** END CHANGE **/ >>> >>> // The rest is normal usage of Gandiva projector >>> auto expr_ptr = std::make_shared<gandiva::Expression>(root_node, >> result_field); >>> std::shared_ptr<gandiva::Projector> projector; >>> auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector); >>> >>> auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr); >>> auto* pool = arrow::default_memory_pool(); >>> arrow::ArrayVector outputs; >>> auto status = projector->Evaluate(*in_batch, pool, &outputs); >>> >>> The code block enclosed by “BEGIN CHANGE” and “END CHANGE” is the >> proposed >>> usage of the parser. It offers two benefits: >>> >>> 1. It’s more intuitive to write math expressions compared to >>> constructing trees, thus easier to use. >>> 2. It allows dynamically adding new expressions or and changing >> existing >>> ones with a runtime hot-loaded config file without restarting our >> server. >>> >>> Syntax >>> >>> The goal is to design a succinct and intuitive grammar for both schema >> and >>> Gandiva expressions. We will need a corresponding grammar for each Node >>> type in Gandiva. >>> >>> - >>> >>> Literals: We find Rust’s literal representation( >>> <https://doc.rust-lang.org/rust-by-example/types/literals.html> >>> https://doc.rust-lang.org/rust-by-example/types/literals.html) very >>> intuitive. We’ll support suffixes such as i32, u64, f32 to denote a >>> literal node’s type. The types of unsuffixed literals are inferred >> by their >>> usage. Otherwise, the default type for integers is int32 and float32 >> for >>> floating points. String and binary literals are wrapped with single >> or >>> double quotes. Decimal128 literals will not be supported in the first >>> version. >>> - >>> >>> Fields: Just their names as defined in the schema. To avoid conflicts >>> with other node types, field names must start with alphabetical >> letters. >>> - >>> >>> Functions: <function_name>(<param1>, <param2>, ...). For functions >> with >>> multiple overloads, their return type is inferred from input types. >> For >>> commonly used functions, we would also like to support infix forms. >> They >>> include: >>> - Comparisons: equal(==), not equal(!=), greater than(>), greater >> than >>> or equal to(>=), less than(<), less than or equal to(<=) >>> - Arithmetics: add(+), subtract(-), multiply(*), divide(/), >>> modulo(%), power(^), bitwise and(&), bitwise or(|), bitwise >>> xor(^), bitwise >>> not(~) >>> >>> Function aliases with spaces in their names won’t be supported such >> as >>> “is not false” are not supported. >>> - >>> >>> Ifs: We would like to support two grammars for if expressions: >>> - if(<cond>, <then>, <else>) for its simplicity and functional feel; >>> - if(<cond>) { <then> } else { <else> } since it’s the C++ if >> grammar >>> and has better formatting for complex expressions. >>> - >>> >>> Booleans: We would like to support both && || and and or keywords the >>> same as C++. >>> - >>> >>> InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is >> also >>> inferred. >>> >>> The grammar can be roughly represented as: >>> >>> exp := term >>> term := literal | field | function | if | boolean | inexpression >>> literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL | >>> FLOAT_LITERAL_SUFFIX | "-" literal >>> field := IDENTIFIER >>> function := infix_function | named_function >>> infix_function := "-" term | term "+" term | term "-" term ... >>> named_function := IDENTIFIER(arguments) >>> arguments := term | argument "," term >>> if := "if" "(" term "," term "," term ")" | "if" "(" term ")" "{" >>> term"}" ELSE "{" term "}" >>> boolean := and_boolean | or_boolean >>> and_boolean := term AND term | term "&&" term >>> or_boolean := term OR term | term "||" term >>> inexpression := term IN "(" arguments ")" >>> >>> lower cases are non-terminals and upper cases are tokens. >>> Implementation Lexing and Parsing >>> >>> We would like to use flex and bison for lexing and parsing. They have >>> several advantages compared to other options like Antlr4 and >> Boost::Spirit. >>> >>> 1. They are the most classical and popular parsing library in the cpp >>> world. >>> 2. They allow us to precompile the parser codes and have no runtime >>> dependencies. >>> >>> Flex&bison takes a string and outputs a gandiva::node tree. The tree may >>> contain untyped nodes, e.g., unsuffixed literals and functions. >>> Type Inference >>> >>> We’ll have a TypeInferenceVisitor class that inherits node visitor, >>> implementing a 2-pass DFS algorithm to do type inference. In each pass, >> the >>> visitor tries to infer current node’s and its children’s types from >>> currently available info: >>> >>> If it’s a non-leaf node such as function ,if ,boolean: >>> >>> 1. First visit each child: let them infer their own types as much as >>> they can. >>> 2. Create a SignaturePattern based on currently known types. The >> pattern >>> includes the current node’s type and children’s types. The types can >> be >>> nullptr meaning the type is currently unknown. For example, the >>> SignaturePattern for func(x: u64, 5: untyped) will be (u64, >>> nullptr)—>nullptr. >>> 3. Get all available signatures of the current node. For functions, >> it’s >>> the signatures registered at the function registry. For if, it’s >> (bool, >>> <type>, <type>)—><type> for any type etc. >>> 4. Try to match each signature with the current pattern >> SignaturePattern >>> 1. If no one matches, it’s a type error. >>> 2. If only one matches, it’s the matched signature. We then update >>> current node’s and their children’s types based on this signature. >>> 3. If more than one matches, we extract a common pattern from the >> set >>> of matched signatures. For example, (nullptr, nullptr)—>bool can >> be >>> extracted from (double, double)—>bool and (int32, int32)—>bool . >> We >>> then update types based on this common pattern. >>> >>> If it’s a leaf node, just update its value if its type is set by its >>> parent. We need to do this because untyped literals’ values are saved in >> a >>> generic container. >>> >>> We run this procedure 2 times, with the second pass a bit different. In >> the >>> second pass, if a literal node is still untyped, give it a default type ( >>> int32 for ints and float32 for floats). >>> >>> 1. The first pass is bottom up propagation of types. A parent’s type >> is >>> inferred from its children. >>> 2. The second pass is both: >>> 1. top down propagation of first pass’ info. Once a parent’s type >> is >>> known in the first pass, it may set it’s children’s type >>> 2. bottom up of default literal types. A literal is given a >> default >>> type. >>> >>> In the second pass, since all leaf nodes’ types are known. The types of >> all >>> nodes are guaranteed to be inferred. >>> Proof of correctness >>> >>> I have a proof of correctness of this inference procedure in mind, but >> it’s >>> too long to be written down here. (Fermat did it too 🙂) But the >>> correctness is based on these two facts: >>> >>> 1. There are no overloads that differ only on their return types. >> E.g. >>> there are no func := (int32, int32)-->int32 and func := (int32, >>> int32)-->double in the Gandiva registry. Therefore we can always >> know a >>> function’s type if its children’s types are known. >>> 2. There are no overloads that accept only non-default numeric types. >>> For example, if there is a func := (int16, int16)-->int16 and func := >>> (int64, int64)-->int64, but no func := (int32, int32)-->int32, then >> the >>> inference procedure will fail on expression func(1, 2) because it >> cannot >>> tell the types of 1 and 2, and giving them default types won't work. >>> >>> Prototype and Examples >>> >>> I’ve already written a mostly working prototype of the parser and type >>> inference visitor and unit tests for them. They are on the branch >>> <https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference> >>> https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference. >>> >>> You can checkout the diffs here: >>> < >> https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference >>> >>> >> https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference >>> >>> The main files are: >>> >>> 1. cpp/src/gandiva/grammar.yy: grammar rules for Bison. >>> 2. cpp/src/gandiva/lex.ll: lex rules for Flex. >>> 3. cpp/src/gandiva/typeinference.h/cc: type inference procedure. >>> 4. cpp/src/gandiva/parser.cc: the driver class that combines the >> three >>> components. >>> 5. cpp/src/gandiva/parser_test.cc: unit tests containing examples of >> the >>> proposed syntax and the result expression trees the parser >> generates. You >>> can run the tests by running cmake .. --preset=ninja-debug-gandiva >> and ninja >>> test-gandiva-tests. >>> >>> Any suggestion/question is appreciated! >>> >>> >>> Best regards, >>> >>> Jin Shang >>> >>