Re: [C++][Gandiva] Proposal to Add A Parser Frontend for Gandiva

Jin Shang Mon, 19 Sep 2022 11:10:13 -0700

Hi Weston,

Thank you for your reply! I would like to address some of your concerns.


> 2022年9月20日 00:59，Weston Pace <weston.p...@gmail.com> 写道：
> 
> First, I think you are correct that there is a lot of value to users here.
> I'd love for a capability like this to someday be in pyarrow too for Arrow
> compute functions.
> 
> I think there is a distinct enough difference between "a query language"
> and "a programming language".  However, both of them are supersets of "an
> expression language".  Is there not already an existing syntax that can be
> used?  I would recommend at least looking at Postgres' language for
> expression (value expressions)[1].  I don't think everything applies but I
> do think you might want to eventually add a few more of the things found
> there such as:
> 
This proposal was intended to provide a textual representation of Gandiva 
expression, hence the proposed syntax is simply a direct translation of 
Gandiva’s existing AST.

So far, Gandiva’s AST has 5 types of nodes:
1. Fields: placeholders for input.
2. Functions: same as the name suggests. 
3. Literals: same as the name suggests.
4. Booleans: logical AND/OR relationship between a series of nodes.
5. In expression: whether the result of a expression is in a set of literals.

> * nested / list array subscripts (e.g. x.y.z == 50 || my_list_array[0] ==
> false)
> * positional / named parameters (e.g. my_foo == $1)

As we can see, the expressiveness of this AST is very limited. Unfortunately 
none of nested types, lists or variable assignments is supported by Gandiva.

> * parentheses to allow explicit precedence (e.g. (x + 5) * 20)

I implement parentheses but forgot to add it to the specification. I will 
attach a fixed version at the end of this email.

> 
> As an advocate of Substrait I will at least suggest looking into it and
> possibly proposing a text format for Substrait expressions.
> 

I have also been experimenting with Acero. I would possibly make a proposal for 
Substrait if my team and I find a textual format helpful.


> Some specific feedback follows:
> 
>> For functions with
>>   multiple overloads, their return type is inferred from input types. For
>>   commonly used functions, we would also like to support infix forms.
> 
> Is there a reason you don't have logical and (&&), logical or (||) and
> logical negation (!)?  Actually, I see you call these out separately later
> on (except you are maybe missing negation).  Is there any reason for the
> distinction?
> 

All three operators are supported, although in different ways.
Gandiva’s AST separates out logical AND/OR as a separate type of node. I don’t 
know of the exact reason but my guess is AND/OR can take variable number of 
arguments and Gandiva doesn’t support variadic functions.
Logical negation always takes one argument, so it’s just a normal function. And 
I also implemented the infix form for it.

>>   Ifs: We would like to support two grammars for if expressions:
>>   - if(<cond>, <then>, <else>) for its simplicity and functional feel;
>>      - if(<cond>) { <then> } else { <else> } since it’s the C++ if
> grammar
>>      and has better formatting for complex expressions.
> 
> Given the fact we are creating functional expressions and not procedural
> code I would find the `if() {} else {}` syntax rather confusing.  Do you
> have an example of one of these complex expressions?
> 
I agree. I’ll remove the if {} else {} version.

>>  InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is also
>>  inferred.
> 
> What type?  The type of `<eval>`?  Why infer it?  I agree it can be
> inferred but it seems like adding inference support (instead of just
> assuming default types and rejecting invalid queries) is adding more
> complexity for both parser and user (remember the rules of when I have to
> type literals).  Are there other places where inference is needed?
> 

The type of the whole expression "<expr> IN (…)”. I made a mistake here: the 
type of InExpressions is always boolean. Thanks for pointing out!

>> The grammar can be roughly represented as:
> 
> I'm not great at reading grammars but does your proposal support string
> literals?
> 

I implemented string literals but forgot to put it in the specification. 

Here is a fixed grammar, with changes marked in red:

exp := term
term := literal | field | function | if | boolean | inexpression | ”(“ term “)"
literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL |
FLOAT_LITERAL_SUFFIX | STRING_LITERAL
field := IDENTIFIER
function := infix_function | named_function
infix_function := "-" term | term "+" term | term "-" term ...
named_function := IDENTIFIER(arguments)
arguments := term | argument "," term
if := "if" "(" term "," term "," term ")" | "if" "(" term ")" "{"
term"}" ELSE "{" term "}"
boolean := and_boolean | or_boolean
and_boolean := term AND term | term "&&" term
or_boolean := term OR term | term "||" term
inexpression := term IN "(" arguments ")"

Let me know if you have more feedback or questions!

Best regards,
Jin

> [1] https://www.postgresql.org/docs/current/sql-expressions.html
> 
> On Sun, Sep 18, 2022 at 9:12 AM Antoine Pitrou <anto...@python.org> wrote:
> 
>> 
>> Hello,
>> 
>> I would add that Gandiva does not seem to have a lot of active
>> maintainers nowadays (as opposed to people who merely add functions for
>> their own use cases).  If you would like to take such a responsability
>> it would probably be very welcome.
>> 
>> Regards
>> 
>> Antoine.
>> 
>> 
>> 
>> Le 16/09/2022 à 17:57, Jin Shang a écrit :
>>> Background
>>> 
>>> My team uses an expression computation library for our C++ feature
>>> engineering pipeline. We currently use Exprtk
>>> <https://github.com/ArashPartow/exprtk>. We recently tried out Gandiva
>> and
>>> wrote some benchmarks. We discovered that Gandiva is several times faster
>>> than Exprtk in our use cases. Therefore we decided to switch to Gandiva
>> for
>>> computing expressions.
>>> Objective
>>> 
>>> As of current, due to its lack of a frontend, we need to manually
>> construct
>>> an AST to use Gandiva. This is inconvenient and requires extra learning
>>> costs. We also want to enable our ML engineers to dynamically
>> create/update
>>> an expression with runtime hot-loaded configs without restarting our
>>> server. This is currently impossible with Gandiva because the Expression
>>> tree is statically created with C++ and must be compiled in the server
>>> binary.
>>> 
>>> Therefore, we would like to implement a parser frontend for Gandiva, so
>>> that Gandiva becomes a standalone complete expression compiler and
>>> evaluator, and a drop-in replacement for the existing libraries like
>> Exprtk
>>> <https://github.com/ArashPartow/exprtk> and TinyExpr
>>> <https://github.com/codeplea/tinyexpr>. The goal is to enable the
>> following
>>> functionality:
>>> 
>>> // Create schema for gandiva
>>> auto field_x = arrow::field("x", arrow::uint64());
>>> auto field_y = arrow::field("y", arrow::float64());
>>> auto field_z = arrow::field("z", arrow::boolean());
>>> auto schema = arrow::schema({field_x, field_y, field_z});
>>> 
>>> /** BEGIN CHANGE **/
>>> // Use the Parser to generate a NodePtr
>>> std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
>>> auto parser = gandiva::Parser(schema);
>>> gandiva::NodePtr root_node;
>>> auto status = parser.Parse(expr_str, &root_node);
>>> /** END CHANGE **/
>>> 
>>> // The rest is normal usage of Gandiva projector
>>> auto expr_ptr = std::make_shared<gandiva::Expression>(root_node,
>> result_field);
>>> std::shared_ptr<gandiva::Projector> projector;
>>> auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);
>>> 
>>> auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
>>> auto* pool = arrow::default_memory_pool();
>>> arrow::ArrayVector outputs;
>>> auto status = projector->Evaluate(*in_batch, pool, &outputs);
>>> 
>>> The code block enclosed by “BEGIN CHANGE” and “END CHANGE” is the
>> proposed
>>> usage of the parser. It offers two benefits:
>>> 
>>>    1. It’s more intuitive to write math expressions compared to
>>>    constructing trees, thus easier to use.
>>>    2. It allows dynamically adding new expressions or and changing
>> existing
>>>    ones with a runtime hot-loaded config file without restarting our
>> server.
>>> 
>>> Syntax
>>> 
>>> The goal is to design a succinct and intuitive grammar for both schema
>> and
>>> Gandiva expressions. We will need a corresponding grammar for each Node
>>> type in Gandiva.
>>> 
>>>    -
>>> 
>>>    Literals: We find Rust’s literal representation(
>>>    <https://doc.rust-lang.org/rust-by-example/types/literals.html>
>>>    https://doc.rust-lang.org/rust-by-example/types/literals.html) very
>>>    intuitive. We’ll support suffixes such as i32, u64, f32 to denote a
>>>    literal node’s type. The types of unsuffixed literals are inferred
>> by their
>>>    usage. Otherwise, the default type for integers is int32 and float32
>> for
>>>    floating points. String and binary literals are wrapped with single
>> or
>>>    double quotes. Decimal128 literals will not be supported in the first
>>>    version.
>>>    -
>>> 
>>>    Fields: Just their names as defined in the schema. To avoid conflicts
>>>    with other node types, field names must start with alphabetical
>> letters.
>>>    -
>>> 
>>>    Functions: <function_name>(<param1>, <param2>, ...). For functions
>> with
>>>    multiple overloads, their return type is inferred from input types.
>> For
>>>    commonly used functions, we would also like to support infix forms.
>> They
>>>    include:
>>>    - Comparisons: equal(==), not equal(!=), greater than(>), greater
>> than
>>>       or equal to(>=), less than(<), less than or equal to(<=)
>>>       - Arithmetics: add(+), subtract(-), multiply(*), divide(/),
>>>       modulo(%), power(^), bitwise and(&), bitwise or(|), bitwise
>>> xor(^), bitwise
>>>       not(~)
>>> 
>>>    Function aliases with spaces in their names won’t be supported such
>> as
>>>    “is not false” are not supported.
>>>    -
>>> 
>>>    Ifs: We would like to support two grammars for if expressions:
>>>    - if(<cond>, <then>, <else>) for its simplicity and functional feel;
>>>       - if(<cond>) { <then> } else { <else> } since it’s the C++ if
>> grammar
>>>       and has better formatting for complex expressions.
>>>    -
>>> 
>>>    Booleans: We would like to support both && || and and or keywords the
>>>    same as C++.
>>>    -
>>> 
>>>    InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is
>> also
>>>    inferred.
>>> 
>>> The grammar can be roughly represented as:
>>> 
>>> exp := term
>>> term := literal | field | function | if | boolean | inexpression
>>> literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL |
>>> FLOAT_LITERAL_SUFFIX | "-" literal
>>> field := IDENTIFIER
>>> function := infix_function | named_function
>>> infix_function := "-" term | term "+" term | term "-" term ...
>>> named_function := IDENTIFIER(arguments)
>>> arguments := term | argument "," term
>>> if := "if" "(" term "," term "," term ")" | "if" "(" term ")" "{"
>>> term"}" ELSE "{" term "}"
>>> boolean := and_boolean | or_boolean
>>> and_boolean := term AND term | term "&&" term
>>> or_boolean := term OR term | term "||" term
>>> inexpression := term IN "(" arguments ")"
>>> 
>>> lower cases are non-terminals and upper cases are tokens.
>>> Implementation Lexing and Parsing
>>> 
>>> We would like to use flex and bison for lexing and parsing. They have
>>> several advantages compared to other options like Antlr4 and
>> Boost::Spirit.
>>> 
>>>    1. They are the most classical and popular parsing library in the cpp
>>>    world.
>>>    2. They allow us to precompile the parser codes and have no runtime
>>>    dependencies.
>>> 
>>> Flex&bison takes a string and outputs a gandiva::node tree. The tree may
>>> contain untyped nodes, e.g., unsuffixed literals and functions.
>>> Type Inference
>>> 
>>> We’ll have a TypeInferenceVisitor class that inherits node visitor,
>>> implementing a 2-pass DFS algorithm to do type inference. In each pass,
>> the
>>> visitor tries to infer current node’s and its children’s types from
>>> currently available info:
>>> 
>>> If it’s a non-leaf node such as function ,if ,boolean:
>>> 
>>>    1. First visit each child: let them infer their own types as much as
>>>    they can.
>>>    2. Create a SignaturePattern based on currently known types. The
>> pattern
>>>    includes the current node’s type and children’s types. The types can
>> be
>>>    nullptr meaning the type is currently unknown. For example, the
>>>    SignaturePattern for func(x: u64, 5: untyped) will be (u64,
>>>    nullptr)—>nullptr.
>>>    3. Get all available signatures of the current node. For functions,
>> it’s
>>>    the signatures registered at the function registry. For if, it’s
>> (bool,
>>>    <type>, <type>)—><type> for any type etc.
>>>    4. Try to match each signature with the current pattern
>> SignaturePattern
>>>       1. If no one matches, it’s a type error.
>>>       2. If only one matches, it’s the matched signature. We then update
>>>       current node’s and their children’s types based on this signature.
>>>       3. If more than one matches, we extract a common pattern from the
>> set
>>>       of matched signatures. For example, (nullptr, nullptr)—>bool can
>> be
>>>       extracted from (double, double)—>bool and (int32, int32)—>bool .
>> We
>>>       then update types based on this common pattern.
>>> 
>>> If it’s a leaf node, just update its value if its type is set by its
>>> parent. We need to do this because untyped literals’ values are saved in
>> a
>>> generic container.
>>> 
>>> We run this procedure 2 times, with the second pass a bit different. In
>> the
>>> second pass, if a literal node is still untyped, give it a default type (
>>> int32 for ints and float32 for floats).
>>> 
>>>    1. The first pass is bottom up propagation of types. A parent’s type
>> is
>>>    inferred from its children.
>>>    2. The second pass is both:
>>>       1. top down propagation of first pass’ info. Once a parent’s type
>> is
>>>       known in the first pass, it may set it’s children’s type
>>>       2. bottom up of default literal types. A literal is given a
>> default
>>>       type.
>>> 
>>> In the second pass, since all leaf nodes’ types are known. The types of
>> all
>>> nodes are guaranteed to be inferred.
>>> Proof of correctness
>>> 
>>> I have a proof of correctness of this inference procedure in mind, but
>> it’s
>>> too long to be written down here. (Fermat did it too 🙂) But the
>>> correctness is based on these two facts:
>>> 
>>>    1. There are no overloads that differ only on their return types.
>> E.g.
>>>    there are no func := (int32, int32)-->int32 and func := (int32,
>>>    int32)-->double in the Gandiva registry. Therefore we can always
>> know a
>>>    function’s type if its children’s types are known.
>>>    2. There are no overloads that accept only non-default numeric types.
>>>    For example, if there is a func := (int16, int16)-->int16 and func :=
>>>    (int64, int64)-->int64, but no func := (int32, int32)-->int32, then
>> the
>>>    inference procedure will fail on expression func(1, 2) because it
>> cannot
>>>    tell the types of 1 and 2, and giving them default types won't work.
>>> 
>>> Prototype and Examples
>>> 
>>> I’ve already written a mostly working prototype of the parser and type
>>> inference visitor and unit tests for them. They are on the branch
>>> <https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference>
>>> https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference.
>>> 
>>> You can checkout the diffs here:
>>> <
>> https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference
>>> 
>>> 
>> https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference
>>> 
>>> The main files are:
>>> 
>>>    1. cpp/src/gandiva/grammar.yy: grammar rules for Bison.
>>>    2. cpp/src/gandiva/lex.ll: lex rules for Flex.
>>>    3. cpp/src/gandiva/typeinference.h/cc: type inference procedure.
>>>    4. cpp/src/gandiva/parser.cc: the driver class that combines the
>> three
>>>    components.
>>>    5. cpp/src/gandiva/parser_test.cc: unit tests containing examples of
>> the
>>>    proposed syntax and the result expression trees the parser
>> generates. You
>>>    can run the tests by running cmake .. --preset=ninja-debug-gandiva
>> and ninja
>>>    test-gandiva-tests.
>>> 
>>> Any suggestion/question is appreciated!
>>> 
>>> 
>>> Best regards,
>>> 
>>> Jin Shang
>>> 
>>

Re: [C++][Gandiva] Proposal to Add A Parser Frontend for Gandiva

Reply via email to