Re: Parser for expressions

Joris Van den Bossche Wed, 12 Oct 2022 09:03:17 -0700

Another advantage of "add(x, y)" is that this matches our current string
representation for expressions.


Although that might give the impression that we support anything that we
output as string, and so that raises the question if we want to make this
explicit: if we add parsing capabilities, would it be a goal to be able to
roundtrip (simple) expressions in ToString -> parse again?

Joris

On Tue, 11 Oct 2022 at 18:59, Weston Pace <weston.p...@gmail.com> wrote:

> SQL is nearly universally understood so unless there is a compelling
> reason I tend to use that as my default.
>
> I don't see any particular advantage to favoring "(add x y)" over "add(x,
> y)"
>
> I will acknowledge that there are downsides to supporting x + y, I
> think you listed these out already.
>
> So, for exprssions, I think it'd be fine if Acero initially supported
> "add(x, y)" without supporting infix operators (and gandiva supported
> both) as long as there is a clear error message (e.g. "please use
> add(x,y) instead of x+y").  This simplifies parsing and should avoid
> confusion between the two.
>
> If you want to then provide support for nodes / relations I think we
> will need to deviate from SQL as it is simply not expressive enough.
>
> On Mon, Oct 10, 2022 at 12:17 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >
> > I don't see the point of having two different syntaxes.
> >
> > Also, IMHO lisp-style is harder for many people, so I would rather a
> > more "traditional" syntax (though Lisp is historically traditional, of
> > course ;-)).
> >
> >
> > Le 10/10/2022 à 21:10, Sasha Krassovsky a écrit :
> > > Yes that makes a lot of sense! I’d agree that it would probably be
> fine to have two different syntaxes, seeing as the use-cases are a bit
> different.
> > >
> > > Did anyone else have any thoughts? Either on the lisp-style syntax for
> Arrow’s Expressions or on having two different syntaxes? (Weston or
> Antoine?)
> > >
> > > Sasha
> > >
> > >> On Oct 9, 2022, at 5:38 AM, Jin Shang <shangjin1...@gmail.com> wrote:
> > >>
> > >> Hi Sasha,
> > >>
> > >> I agree with your points. However Gandiva is kind of specialized in
> computing arithmetic expressions and it offers little to none
> non-arithmetic operations. So it is very helpful if its parser understands
> natural math expressions.
> > >>
> > >> Considering that Gandiva is a relatively independent component within
> the arrow project, and that it’s only a math expression compiler rather
> than a fully functioned compute engine, maybe it’s acceptable for Gandiva
> to have its own grammar different from compute/Acero/Substrait etc.
> > >>
> > >> Best,
> > >> Jin
> > >>
> > >>> 2022年10月8日 03:01，Sasha Krassovsky <krassovskysa...@gmail.com> 写道：
> > >>>
> > >>> Hi Jin,
> > >>> I agree it would be good to standardize on a syntax. To me, the
> advantages of the lisp-style syntax are:
> > >>> - don’t have to define/implement any kind of precedence rules
> > >>> - has a uniform syntax (no distinction between prefix and infix
> operators)
> > >>> - avoids having “special” functions that have an associated
> arithmetic symbol
> > >>> - translates directly to the underlying Expression infrastructure.
> > >>>
> > >>> The advantage of the Python-style syntax is that it’s more natural
> to use for arithmetic expressions. However, I think for non-arithmetic
> expressions this syntax would be more cumbersome.
> > >>>
> > >>> Either would work of course, I guess it just depends on the goal. I
> was thinking the string representation wouldn’t represent any significant
> level of abstraction, it is just a convenience to save on clutter when
> typing out expressions.
> > >>>
> > >>> Sasha
> > >>>
> > >>>> 6 окт. 2022 г., в 22:20, Jin Shang <shangjin1...@gmail.com>
> написал(а):
> > >>>>
> > >>>> Hi Sasha and Weston,
> > >>>>
> > >>>> I'm the author of the mentioned Gandiva parser. I agree that having
> one
> > >>>> unified syntax is ideal. I think one critical divergence between
> Sasha's
> > >>>> and my proposals is that mine is with C++/Python imperative style
> (foo(x,
> > >>>> y, z), a+b…) and Sasha's is with Lisp functional style ((foo x y
> z), (+ a
> > >>>> b)…). I feel like it'll be better for us to settle on one of the
> styles
> > >>>> before we start implementing the parsers.
> > >>>>
> > >>>> Best,
> > >>>> Jin
> > >>>>
> > >>>>> On Friday, October 7, 2022, Sasha Krassovsky <
> krassovskysa...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>> Hi Weston,
> > >>>>> I’d be happy to donate something like this to Sunstrait if that’s
> useful,
> > >>>>> I was thinking of proving out a design here before going there.
> However we
> > >>>>> could also just go straight there :)
> > >>>>>
> > >>>>> Regarding infix operators and such the edge case I was thinking of
> is that
> > >>>>> a user could potentially add a kernel to the registry called e.g.
> “+”.
> > >>>>> Would the parser implicitly convert any instances of “+” to “add”
> and break
> > >>>>> that?
> > >>>>>
> > >>>>> Implicit typing for literals and parameters can probably also be
> added
> > >>>>> without issues to the current scheme. Would the parameters be
> passed as an
> > >>>>> std::unordered_map?
> > >>>>>
> > >>>>>> Does a field_ref have to be a field name or can it be a field
> index?
> > >>>>>
> > >>>>> It can be a field index or even a field path. The field ref is
> parsed
> > >>>>> using FieldRef::FromDotPath ([1] in my original message), which
> can express
> > >>>>> any FieldRef.
> > >>>>>
> > >>>>> Sasha
> > >>>>>
> > >>>>>>> 6 окт. 2022 г., в 16:08, Weston Pace <weston.p...@gmail.com>
> написал(а):
> > >>>>>>
> > >>>>>> Currently Substrait only has a binary (protobuf) serialization
> (and a
> > >>>>>> protobuf JSON one but that's not really human writable and barely
> > >>>>>> human readable).  Substrait does not have a text serialization.  I
> > >>>>>> believe there is some desire for one (maybe Sasha wants to give
> it a
> > >>>>>> try?).  A text format for Substrait would solve this problem
> because
> > >>>>>> you could go "text expression" -> "substrait expression" -> "arrow
> > >>>>>> expression".
> > >>>>>>
> > >>>>>> Since no text format exists for Substrait I think that Substrait
> does
> > >>>>>> not currently solve this problem or overlap with your work.
> However,
> > >>>>>> at some point (hopefully), it will.
> > >>>>>>
> > >>>>>> There was also a fairly recent proposal for a parser for gandiva
> > >>>>> expressions[1].
> > >>>>>>
> > >>>>>> Compared with [1] I think this proposal is simpler to parse but
> lacks
> > >>>>>> some of the shortcut conveniences (e.g. implicit types for
> literals,
> > >>>>>> support for common infix operators (+, -, /, ...)).
> > >>>>>>
> > >>>>>> Both are lacking parameters (e.g. "(equals(!x, %threshold%))"
> which I
> > >>>>> think
> > >>>>>> would be useful to have as one could then do something like `auto
> > >>>>>> arrow_expr = Parse(my_expr, threshold)`.
> > >>>>>>
> > >>>>>> Does a field_ref have to be a field name or can it be a field
> index?
> > >>>>>> The latter is quite useful when the schema has duplicate field
> names.
> > >>>>>>
> > >>>>>> I'm +0.5 on this change.  I worry a bit about having (eventually)
> > >>>>>> three different syntaxes.  However, at the moment we have zero.
> > >>>>>>
> > >>>>>> [1]
> https://lists.apache.org/thread/0oyns380hgzvl0y8kwgqoo4fp7ntt3bn
> > >>>>>>
> > >>>>>>> On Wed, Oct 5, 2022 at 1:55 PM Sasha Krassovsky
> > >>>>>>> <krassovskysa...@gmail.com> wrote:
> > >>>>>>>
> > >>>>>>> Hi David,
> > >>>>>>> Could you elaborate on which part of my proposal overlaps with
> > >>>>> Substrait? I don’t see anything in Substrait that allows me to do
> something
> > >>>>> along the lines of
> > >>>>>>>
> > >>>>>>> Expression e = Expression::FromString(“(add !.a $int32:1)”);
> > >>>>>>>
> > >>>>>>> in the code.
> > >>>>>>>
> > >>>>>>> Sasha
> > >>>>>>>
> > >>>>>>>>> On Oct 5, 2022, at 1:35 PM, Lee, David <
> david....@blackrock.com.INVALID>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> I believe this is what substrait.io <http://substrait.io/> is
> trying
> > >>>>> to accomplish..
> > >>>>>>>>
> > >>>>>>>> Here's some additional info:
> > >>>>>>>> https://substrait.io/ <https://substrait.io/>
> > >>>>>>>>
> > >>>>>>>> https://www.youtube.com/watch?v=5JjaB7p3Sjk <
> https://www.youtube.com/
> > >>>>> watch?v=5JjaB7p3Sjk>
> > >>>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: Sasha Krassovsky <krassovskysa...@gmail.com <mailto:
> > >>>>> krassovskysa...@gmail.com>>
> > >>>>>>>> Sent: Wednesday, October 5, 2022 11:29 AM
> > >>>>>>>> To: dev@arrow.apache.org <mailto:dev@arrow.apache.org>
> > >>>>>>>> Subject: Parser for expressions
> > >>>>>>>>
> > >>>>>>>> External Email: Use caution with links and attachments
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Hi everyone,
> > >>>>>>>> I’ve noticed on the mailing list a few times people asking for
> a more
> > >>>>> convenient way to construct an Expression, namely using a string
> of some
> > >>>>> sort. I’ve found myself wishing for something like this too when
> > >>>>> constructing ExecPlans, and so I’ve gone ahead and implemented a
> parser
> > >>>>> [0]. I was wondering if anyone had any thoughts about the design
> of the
> > >>>>> language?
> > >>>>>>>>
> > >>>>>>>> The current implementation parses a lisp-like language. This
> language
> > >>>>> has three types of expressions (mirroring the current Expression
> API):
> > >>>>>>>>
> > >>>>>>>> - A call is a normal s-expression, it has the name of the
> kernel and
> > >>>>> the list of arguments. Its arguments can be any expression.
> > >>>>>>>> - A literal (i.e. scalar) starts with a $ and specifies a type
> and a
> > >>>>> value, separated by a colon. For example, `$decimal(12,2):10.01`
> specifies
> > >>>>> a literal of type decimal(12, 2) and a value of 10.01.
> > >>>>>>>> - A field_ref starts with a ! and is an identifier in the schema
> > >>>>> following the DotPath syntax we already have [1].
> > >>>>>>>>
> > >>>>>>>> So for example, the expression
> > >>>>>>>>
> > >>>>>>>> (add $int32:1 (multiply !.a !.b))
> > >>>>>>>>
> > >>>>>>>> computes a*b+1 given a batch with columns named a and b.
> > >>>>>>>>
> > >>>>>>>> The reason I chose a lisp-like language is that it very directly
> > >>>>> translates to the current Expression API and that it feels more
> natural to
> > >>>>> use a prefix notation for a language where all functions have a
> name (i.e.
> > >>>>> no +, -, *, etc.).
> > >>>>>>>>
> > >>>>>>>> I’m currently working on a followup PR for specifying ExecPlans
> from a
> > >>>>> string (mainly for easier testing), and would like that language
> to be an
> > >>>>> extension of this one. Looking forward to hearing everyone’s
> thoughts!
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Sasha Krassovsky
> > >>>>>>>>
> > >>>>>>>> [0] https://urldefense.com/v3/__https://github.com/apache/
> > >>>>> arrow/pull/14287__;!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
> > >>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG_6oZdDk$ <
> > >>>>> https://urldefense.com/v3/__https://github.com/apache/
> > >>>>> arrow/pull/14287__;!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
> > >>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG_6oZdDk$>
>  <
> > >>>>> https://urldefense.com/v3/__https://github.com/apache/
> > >>>>> arrow/pull/14287__;!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
> > >>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG_6oZdDk$ <
> > >>>>> https://urldefense.com/v3/__https://github.com/apache/
> > >>>>> arrow/pull/14287__;!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
> > >>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG_6oZdDk$>  >
> > >>>>>>>> [1] https://urldefense.com/v3/__https://github.com/apache/
> > >>>>> arrow/blob/master/cpp/src/arrow/type.h*L1726__;Iw!!KSjYCgUGsB4!
> > >>>>> enYRTooMrwyJKJzgTlQMdMhpfT7ys3Ol8a8HcHUvxRYRN-a-Up_
> > >>>>> axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG0GkL0Mn$ <
> https://urldefense.com/v3/__
> > >>>>> https://github.com/apache/arrow/blob/master/cpp/src/
> > >>>>> arrow/type.h*L1726__;Iw!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
> > >>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG0GkL0Mn$>
>  <
> > >>>>> https://urldefense.com/v3/__https://github.com/apache/
> > >>>>> arrow/blob/master/cpp/src/arrow/type.h*L1726__;Iw!!KSjYCgUGsB4!
> > >>>>> enYRTooMrwyJKJzgTlQMdMhpfT7ys3Ol8a8HcHUvxRYRN-a-Up_
> > >>>>> axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG0GkL0Mn$ <
> https://urldefense.com/v3/__
> > >>>>> https://github.com/apache/arrow/blob/master/cpp/src/
> > >>>>> arrow/type.h*L1726__;Iw!!KSjYCgUGsB4!enYRTooMrwyJKJzgTlQMdMhpfT7ys3
> > >>>>> Ol8a8HcHUvxRYRN-a-Up_axLfPGOpUtEDCDs0ee7lHPAzVdz-dooULG0GkL0Mn$>  >
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> This message may contain information that is confidential or
> > >>>>> privileged. If you are not the intended recipient, please advise
> the sender
> > >>>>> immediately and delete this message. See http://www.blackrock.com/
> > >>>>> corporate/compliance/email-disclaimers <http://www.blackrock.com/
> > >>>>> corporate/compliance/email-disclaimers> for further information.
> Please
> > >>>>> refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy <
> > >>>>> http://www.blackrock.com/corporate/compliance/privacy-policy> for
> more
> > >>>>> information about BlackRock’s Privacy Policy.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> For a list of BlackRock's office addresses worldwide, see
> > >>>>> http://www.blackrock.com/corporate/about-us/contacts-locations <
> > >>>>> http://www.blackrock.com/corporate/about-us/contacts-locations>.
> > >>>>>>>>
> > >>>>>>>> © 2022 BlackRock, Inc. All rights reserved.
> > >>>>>>>
> > >>>>>
> > >>
> > >
>

Re: Parser for expressions

Reply via email to