Re: [YAML] Aggregations

XQ Hu via dev Mon, 23 Oct 2023 07:00:19 -0700

+1 on your proposal.

On Fri, Oct 20, 2023 at 4:59 PM Robert Bradshaw via dev <dev@beam.apache.org>
wrote:


> On Fri, Oct 20, 2023 at 11:35 AM Kenneth Knowles <k...@apache.org> wrote:
> >
> > A couple other bits on having an expression language:
> >
> >  - You already have Python lambdas at places, right? so that's quite a
> lot more complex than SQL project/aggregate expressions
> >  - It really does save a lot of pain for users (at the cost of
> implementation complexity) when you need to "SUM(col1*col2)" where
> otherwise you have to Map first. This could be viewed as desirable as well,
> of course.
> >
> > Anyhow I'm pretty much in agreement with all your reasoning as to why
> *not* to use SQL-like expressions in strings. But it does seem odd when
> juxtaposed with Python snippets.
>
> Well, we say "here's a Python expression" when we're using a Python
> string. But "SUM(col1*col2)" isn't as transparent. (Agree about the
> niceties of being able to provide an expression rather than a column.)
>
> > On Thu, Oct 19, 2023 at 4:00 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >>
> >> On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax <re...@google.com> wrote:
> >> >
> >> > Is the schema Group transform (in Java) something along these lines?
> >>
> >> Yes, for sure it is. It (and Python's and Typescript's equivalent) are
> >> linked in the original post. The open question is how to best express
> >> this in YAML.
> >>
> >> > On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >> >>
> >> >> Beam Yaml has good support for IOs and mappings, but one key missing
> >> >> feature for even writing a WordCount is the ability to do
> Aggregations
> >> >> [1]. While the traditional Beam primitive is GroupByKey (and
> >> >> CombineValues), we're eschewing KVs in the notion of more schema'd
> >> >> data (which has some precedence in our other languages, see the links
> >> >> below). The key components the user needs to specify are (1) the key
> >> >> fields on which the grouping will take place, (2) the fields
> >> >> (expressions?) involved in the aggregation, and (3) what aggregating
> >> >> fn to use.
> >> >>
> >> >> A straw-man example could be something like
> >> >>
> >> >> type: Aggregating
> >> >> config:
> >> >>   key: [field1, field2]
> >> >>   aggregating:
> >> >>     total_cost:
> >> >>       fn: sum
> >> >>       value: cost
> >> >>     max_cost:
> >> >>       fn: max
> >> >>       value: cost
> >> >>
> >> >> This would basically correspond to the SQL expression
> >> >>
> >> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as
> max_cost
> >> >> from table GROUP BY field1, field2"
> >> >>
> >> >> (though I'm not requiring that we use this as an implementation
> >> >> strategy). I do not think we need a separate (non aggregating)
> >> >> Grouping operation, this can be accomplished by having a concat-style
> >> >> combiner.
> >> >>
> >> >> There are still some open questions here, notably around how to
> >> >> specify the aggregation fns themselves. We could of course provide a
> >> >> number of built-ins (like SQL does). This gets into the question of
> >> >> how and where to document this complete set, but some basics should
> >> >> take us pretty far. Many aggregators, however, are parameterized
> (e.g.
> >> >> quantiles); where do we put the parameters? We could go with
> something
> >> >> like
> >> >>
> >> >> fn:
> >> >>   type: ApproximateQuantiles
> >> >>   config:
> >> >>     n: 10
> >> >>
> >> >> but others are even configured by functions themselves (e.g. LargestN
> >> >> that wants a comparator Fn). Maybe we decide not to support these
> >> >> (yet?)
> >> >>
> >> >> One thing I think we should support, however, is referencing custom
> >> >> CombineFns. We have some precedent for this with our Fns from
> >> >> MapToFields, where we accept things like inline lambdas and external
> >> >> references. Again the topic of how to configure them comes up, as
> >> >> these custom Fns are more likely to be parameterized than Map Fns
> >> >> (though, to be clear, perhaps it'd be good to allow parameterizatin
> of
> >> >> MapFns as well). Maybe we allow
> >> >>
> >> >> language: python. # like MapToFields (and here it'd be harder to mix
> >> >> and match per Fn)
> >> >> fn:
> >> >>   type: ???
> >> >>   # should these be nested as config?
> >> >>   name: fully.qualiied.name
> >> >>   path: /path/to/defining/file
> >> >>   args: [...]
> >> >>   kwargs: {...}
> >> >>
> >> >> which would invoke the constructor.
> >> >>
> >> >> I'm also open to other ways of naming/structuring these essential
> >> >> parameters if it makes things more clear.
> >> >>
> >> >> - Robert
> >> >>
> >> >>
> >> >> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> >> >> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> >> >> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
> >> >>
> >> >> [1] One can of course use SqlTransform for this, but I'm leaning
> >> >> towards offering something more native.
>

Re: [YAML] Aggregations

Reply via email to