Re: Parser for expressions

Sasha Krassovsky Thu, 13 Oct 2022 12:04:17 -0700

Hi everyone,
I’d be fine with switching it to add(x, y). I’ll look into round-trip support, 
I imagine we can massage the ToString implementation a bit as well to make it 
easier to parse back.


Did anyone have opinions about the syntax for FieldRefs or Scalars? Scalars of 
the form $type:value make them particularly easy to parse and I think are easy 
enough to read/write. 

Sasha

> 12 окт. 2022 г., в 09:03, Joris Van den Bossche 
> <jorisvandenboss...@gmail.com> написал(а):
> 
> Another advantage of "add(x, y)" is that this matches our current string
> representation for expressions.
> 
> Although that might give the impression that we support anything that we
> output as string, and so that raises the question if we want to make this
> explicit: if we add parsing capabilities, would it be a goal to be able to
> roundtrip (simple) expressions in ToString -> parse again?
> 
> Joris
> 
>> On Tue, 11 Oct 2022 at 18:59, Weston Pace <weston.p...@gmail.com> wrote:
>> 
>> SQL is nearly universally understood so unless there is a compelling
>> reason I tend to use that as my default.
>> 
>> I don't see any particular advantage to favoring "(add x y)" over "add(x,
>> y)"
>> 
>> I will acknowledge that there are downsides to supporting x + y, I
>> think you listed these out already.
>> 
>> So, for exprssions, I think it'd be fine if Acero initially supported
>> "add(x, y)" without supporting infix operators (and gandiva supported
>> both) as long as there is a clear error message (e.g. "please use
>> add(x,y) instead of x+y").  This simplifies parsing and should avoid
>> confusion between the two.
>> 
>> If you want to then provide support for nodes / relations I think we
>> will need to deviate from SQL as it is simply not expressive enough.
>> 
>>> On Mon, Oct 10, 2022 at 12:17 PM Antoine Pitrou <anto...@python.org>
>>> wrote:
>>> 
>>> 
>>> I don't see the point of having two different syntaxes.
>>> 
>>> Also, IMHO lisp-style is harder for many people, so I would rather a
>>> more "traditional" syntax (though Lisp is historically traditional, of
>>> course ;-)).
>>> 
>>> 
>>> Le 10/10/2022 à 21:10, Sasha Krassovsky a écrit :
>>>> Yes that makes a lot of sense! I’d agree that it would probably be
>> fine to have two different syntaxes, seeing as the use-cases are a bit
>> different.
>>>> 
>>>> Did anyone else have any thoughts? Either on the lisp-style syntax for
>> Arrow’s Expressions or on having two different syntaxes? (Weston or
>> Antoine?)
>>>> 
>>>> Sasha
>>>> 
>>>>> On Oct 9, 2022, at 5:38 AM, Jin Shang <shangjin1...@gmail.com> wrote:
>>>>> 
>>>>> Hi Sasha,
>>>>> 
>>>>> I agree with your points. However Gandiva is kind of specialized in
>> computing arithmetic expressions and it offers little to none
>> non-arithmetic operations. So it is very helpful if its parser understands
>> natural math expressions.
>>>>> 
>>>>> Considering that Gandiva is a relatively independent component within
>> the arrow project, and that it’s only a math expression compiler rather
>> than a fully functioned compute engine, maybe it’s acceptable for Gandiva
>> to have its own grammar different from compute/Acero/Substrait etc.
>>>>> 
>>>>> Best,
>>>>> Jin
>>>>> 
>>>>>> 2022年10月8日 03:01，Sasha Krassovsky <krassovskysa...@gmail.com> 写道：
>>>>>> 
>>>>>> Hi Jin,
>>>>>> I agree it would be good to standardize on a syntax. To me, the
>> advantages of the lisp-style syntax are:
>>>>>> - don’t have to define/implement any kind of precedence rules
>>>>>> - has a uniform syntax (no distinction between prefix and infix
>> operators)
>>>>>> - avoids having “special” functions that have an associated
>> arithmetic symbol
>>>>>> - translates directly to the underlying Expression infrastructure.
>>>>>> 
>>>>>> The advantage of the Python-style syntax is that it’s more natural
>> to use for arithmetic expressions. However, I think for non-arithmetic
>> expressions this syntax would be more cumbersome.
>>>>>> 
>>>>>> Either would work of course, I guess it just depends on the goal. I
>> was thinking the string representation wouldn’t represent any significant
>> level of abstraction, it is just a convenience to save on clutter when
>> typing out expressions.
>>>>>> 
>>>>>> Sasha
>>>>>> 
>>>>>>> 6 окт. 2022 г., в 22:20, Jin Shang <shangjin1...@gmail.com>
>> написал(а):
>>>>>>> 
>>>>>>> Hi Sasha and Weston,
>>>>>>> 
>>>>>>> I'm the author of the mentioned Gandiva parser. I agree that having
>> one
>>>>>>> unified syntax is ideal. I think one critical divergence between
>> Sasha's
>>>>>>> and my proposals is that mine is with C++/Python imperative style
>> (foo(x,
>>>>>>> y, z), a+b…) and Sasha's is with Lisp functional style ((foo x y
>> z), (+ a
>>>>>>> b)…). I feel like it'll be better for us to settle on one of the
>> styles
>>>>>>> before we start implementing the parsers.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Jin
>>>>>>> 
>>>>>>>> On Friday, October 7, 2022, Sasha Krassovsky <
>> krassovskysa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Weston,
>>>>>>>> I’d be happy to donate something like this to Sunstrait if that’s
>> useful,
>>>>>>>> I was thinking of proving out a design here before going there.
>> However we
>>>>>>>> could also just go straight there :)
>>>>>>>> 
>>>>>>>> Regarding infix operators and such the edge case I was thinking of
>> is that
>>>>>>>> a user could potentially add a kernel to the registry called e.g.
>> “+”.
>>>>>>>> Would the parser implicitly convert any instances of “+” to “add”
>> and break
>>>>>>>> that?
>>>>>>>> 
>>>>>>>> Implicit typing for literals and parameters can probably also be
>> added
>>>>>>>> without issues to the current scheme. Would the parameters be
>> passed as an
>>>>>>>> std::unordered_map?
>>>>>>>> 
>>>>>>>>> Does a field_ref have to be a field name or can it be a field
>> index?
>>>>>>>> 
>>>>>>>> It can be a field index or even a field path. The field ref is
>> parsed
>>>>>>>> using FieldRef::FromDotPath ([1] in my original message), which
>> can express
>>>>>>>> any FieldRef.
>>>>>>>> 
>>>>>>>> Sasha
>>>>>>>> 
>>>>>>>>>> 6 окт. 2022 г., в 16:08, Weston Pace <weston.p...@gmail.com>
>> написал(а):
>>>>>>>>> 
>>>>>>>>> Currently Substrait only has a binary (protobuf) serialization
>> (and a
>>>>>>>>> protobuf JSON one but that's not really human writable and barely
>>>>>>>>> human readable).  Substrait does not have a text serialization.  I
>>>>>>>>> believe there is some desire for one (maybe Sasha wants to give
>> it a
>>>>>>>>> try?).  A text format for Substrait would solve this problem
>> because
>>>>>>>>> you could go "text expression" -> "substrait expression" -> "arrow
>>>>>>>>> expression".
>>>>>>>>> 
>>>>>>>>> Since no text format exists for Substrait I think that Substrait
>> does
>>>>>>>>> not currently solve this problem or overlap with your work.
>> However,
>>>>>>>>> at some point (hopefully), it will.
>>>>>>>>> 
>>>>>>>>> There was also a fairly recent proposal for a parser for gandiva
>>>>>>>> expressions[1].
>>>>>>>>> 
>>>>>>>>> Compared with [1] I think this proposal is simpler to parse but
>> lacks
>>>>>>>>> some of the shortcut conveniences (e.g. implicit types for
>> literals,
>>>>>>>>> support for common infix operators (+, -, /, ...)).
>>>>>>>>> 
>>>>>>>>> Both are lacking parameters (e.g. "(equals(!x, %threshold%))"
>> which I
>>>>>>>> think
>>>>>>>>> would be useful to have as one could then do something like `auto
>>>>>>>>> arrow_expr = Parse(my_expr, threshold)`.
>>>>>>>>> 
>>>>>>>>> Does a field_ref have to be a field name or can it be a field
>> index?
>>>>>>>>> The latter is quite useful when the schema has duplicate field
>> names.
>>>>>>>>> 
>>>>>>>>> I'm +0.5 on this change.  I worry a bit about having (eventually)
>>>>>>>>> three different syntaxes.  However, at the moment we have zero.
>>>>>>>>> 
>>>>>>>>> [1]
>> https://lists.apache.org/thread/0oyns380hgzvl0y8kwgqoo4fp7ntt3bn
>>>>>>>>> 
>>>>>>>>>> On Wed, Oct 5, 2022 at 1:55 PM Sasha Krassovsky
>>>>>>>>>> <krassovskysa...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi David,
>>>>>>>>>> Could you elaborate on which part of my proposal overlaps with
>>>>>>>> Substrait? I don’t see anything in Substrait that allows me to do
>> something
>>>>>>>> along the lines of
>>>>>>>>>> 
>>>>>>>>>> Expression e = Expression::FromString(“(add !.a $int32:1)”);
>>>>>>>>>> 
>>>>>>>>>> in the code.
>>>>>>>>>> 
>>>>>>>>>> Sasha
>>>>>>>>>> 
>>>>>>>>>>>> On Oct 5, 2022, at 1:35 PM, Lee, David <
>> david....@blackrock.com.INVALID>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I believe this is what substrait.io <http://substrait.io/> is
>> trying
>>>>>>>> to accomplish..
>>>>>>>>>>> 
>>>>>>>>>>> Here's some additional info:
>>>>>>>>>>> https://substrait.io/ <https://substrait.io/>
>>>>>>>>>>> 
>>>>>>>>>>> https://www.youtube.com/watch?v=5JjaB7p3Sjk <
>> https://www.youtube.com/
>>>>>>>> watch?v=5JjaB7p3Sjk>
>>>>>>>>>>> 
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Sasha Krassovsky <krassovskysa...@gmail.com <mailto:
>>>>>>>> krassovskysa...@gmail.com>>
>>>>>>>>>>> Sent: Wednesday, October 5, 2022 11:29 AM
>>>>>>>>>>> To: dev@arrow.apache.org <mailto:dev@arrow.apache.org>
>>>>>>>>>>> Subject: Parser for expressions
>>>>>>>>>>> 
>>>>>>>>>>> External Email: Use caution with links and attachments
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>> I’ve noticed on the mailing list a few times people asking for
>> a more
>>>>>>>> convenient way to construct an Expression, namely using a string
>> of some
>>>>>>>> sort. I’ve found myself wishing for something like this too when
>>>>>>>> constructing ExecPlans, and so I’ve gone ahead and implemented a
>> parser
>>>>>>>> [0]. I was wondering if anyone had any thoughts about the design
>> of the
>>>>>>>> language?
>>>>>>>>>>> 
>>>>>>>>>>> The current implementation parses a lisp-like language. This
>> language
>>>>>>>> has three types of expressions (mirroring the current Expression
>> API):
>>>>>>>>>>> 
>>>>>>>>>>> - A call is a normal s-expression, it has the name of the
>> kernel and
>>>>>>>> the list of arguments. Its arguments can be any expression.
>>>>>>>>>>> - A literal (i.e. scalar) starts with a $ and specifies a type
>> and a
>>>>>>>> value, separated by a colon. For example, `$decimal(12,2):10.01`
>> specifies
>>>>>>>> a literal of type decimal(12, 2) and a value of 10.01.
>>>>>>>>>>> - A field_ref starts with a ! and is an identifier in the schema
>>>>>>>> following the DotPath syntax we already have [1].
>>>>>>>>>>> 
>>>>>>>>>>> So for example, the expression
>>>>>>>>>>> 
>>>>>>>>>>> (add $int32:1 (multiply !.a !.b))
>>>>>>>>>>> 
>>>>>>>>>>> computes a*b+1 given a batch with columns named a and b.
>>>>>>>>>>> 
>>>>>>>>>>> The reason I chose a lisp-like language is that it very directly
>>>>>>>> translates to the current Expression API and that it feels more
>> natural to
>>>>>>>> use a prefix notation for a language where all functions have a
>> name (i.e.
>>>>>>>> no +, -, *, etc.).
>>>>>>>>>>> 
>>>>>>>>>>> I’m currently working on a followup PR for specifying ExecPlans
>> from a
>>>>>>>> string (mainly for easier testing), and would like that language
>> to be an
>>>>>>>> extension of this one. Looking forward to hearing everyone’s
>> thoughts!
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Sasha Krassovsky
>>>>>>>>>>> 
>>>>>>>>>>> [0] https://urldefense.com/v3/__https://github.com/apache/
>>>>>>>> arrow/pull/14287__;!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
>>>>>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG_6oZdDk$ <
>>>>>>>> https://urldefense.com/v3/__https://github.com/apache/
>>>>>>>> arrow/pull/14287__;!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
>>>>>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG_6oZdDk$>
>> <
>>>>>>>> https://urldefense.com/v3/__https://github.com/apache/
>>>>>>>> arrow/pull/14287__;!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
>>>>>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG_6oZdDk$ <
>>>>>>>> https://urldefense.com/v3/__https://github.com/apache/
>>>>>>>> arrow/pull/14287__;!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
>>>>>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG_6oZdDk$>  >
>>>>>>>>>>> [1] https://urldefense.com/v3/__https://github.com/apache/
>>>>>>>> arrow/blob/master/cpp/src/arrow/type.h*L1726__;Iw!!KSjYCgUGsB4!
>>>>>>>> enYRTooMrwyJKJzgTlQMdMhpfT7ys3Ol8a8HcHUvxRYRN-a-Up_
>>>>>>>> axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG0GkL0Mn$ <
>> https://urldefense.com/v3/__
>>>>>>>> https://github.com/apache/arrow/blob/master/cpp/src/
>>>>>>>> arrow/type.h*L1726__;Iw!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
>>>>>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG0GkL0Mn$>
>> <
>>>>>>>> https://urldefense.com/v3/__https://github.com/apache/
>>>>>>>> arrow/blob/master/cpp/src/arrow/type.h*L1726__;Iw!!KSjYCgUGsB4!
>>>>>>>> enYRTooMrwyJKJzgTlQMdMhpfT7ys3Ol8a8HcHUvxRYRN-a-Up_
>>>>>>>> axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG0GkL0Mn$ <
>> https://urldefense.com/v3/__
>>>>>>>> https://github.com/apache/arrow/blob/master/cpp/src/
>>>>>>>> arrow/type.h*L1726__;Iw!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
>>>>>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG0GkL0Mn$>  >
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> This message may contain information that is confidential or
>>>>>>>> privileged. If you are not the intended recipient, please advise
>> the sender
>>>>>>>> immediately and delete this message. See http://www.blackrock.com/
>>>>>>>> corporate/compliance/email-disclaimers <http://www.blackrock.com/
>>>>>>>> corporate/compliance/email-disclaimers> for further information.
>> Please
>>>>>>>> refer to
>> http://www.blackrock.com/corporate/compliance/privacy-policy <
>>>>>>>> http://www.blackrock.com/corporate/compliance/privacy-policy> for
>> more
>>>>>>>> information about BlackRock’s Privacy Policy.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> For a list of BlackRock's office addresses worldwide, see
>>>>>>>> http://www.blackrock.com/corporate/about-us/contacts-locations <
>>>>>>>> http://www.blackrock.com/corporate/about-us/contacts-locations>.
>>>>>>>>>>> 
>>>>>>>>>>> © 2022 BlackRock, Inc. All rights reserved.
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>> 
>>

Re: Parser for expressions

Reply via email to