Re: [YAML] Aggregations

Robert Burke Wed, 18 Oct 2023 18:19:40 -0700

MongoDB has its own concept of aggregation pipelines as well.

https://www.mongodb.com/docs/manual/core/aggregation-pipeline/#std-label-aggregation-pipeline



On Wed, Oct 18, 2023, 6:07 PM Robert Bradshaw via dev <[email protected]>
wrote:

> On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis <[email protected]> wrote:
> >
> > Is it worth taking a look at similar prior art in the space?
>
> +1. Pointers welcome.
>
> > The first one that comes to mind is Transform, but with the dbt labs
> acquisition that spec is a lot harder to find. Rill is pretty similar
> though.
>
> Rill seems to be very SQL-based.
>
> > On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev <
> [email protected]> wrote:
> >>
> >> Beam Yaml has good support for IOs and mappings, but one key missing
> >> feature for even writing a WordCount is the ability to do Aggregations
> >> [1]. While the traditional Beam primitive is GroupByKey (and
> >> CombineValues), we're eschewing KVs in the notion of more schema'd
> >> data (which has some precedence in our other languages, see the links
> >> below). The key components the user needs to specify are (1) the key
> >> fields on which the grouping will take place, (2) the fields
> >> (expressions?) involved in the aggregation, and (3) what aggregating
> >> fn to use.
> >>
> >> A straw-man example could be something like
> >>
> >> type: Aggregating
> >> config:
> >>   key: [field1, field2]
> >>   aggregating:
> >>     total_cost:
> >>       fn: sum
> >>       value: cost
> >>     max_cost:
> >>       fn: max
> >>       value: cost
> >>
> >> This would basically correspond to the SQL expression
> >>
> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
> >> from table GROUP BY field1, field2"
> >>
> >> (though I'm not requiring that we use this as an implementation
> >> strategy). I do not think we need a separate (non aggregating)
> >> Grouping operation, this can be accomplished by having a concat-style
> >> combiner.
> >>
> >> There are still some open questions here, notably around how to
> >> specify the aggregation fns themselves. We could of course provide a
> >> number of built-ins (like SQL does). This gets into the question of
> >> how and where to document this complete set, but some basics should
> >> take us pretty far. Many aggregators, however, are parameterized (e.g.
> >> quantiles); where do we put the parameters? We could go with something
> >> like
> >>
> >> fn:
> >>   type: ApproximateQuantiles
> >>   config:
> >>     n: 10
> >>
> >> but others are even configured by functions themselves (e.g. LargestN
> >> that wants a comparator Fn). Maybe we decide not to support these
> >> (yet?)
> >>
> >> One thing I think we should support, however, is referencing custom
> >> CombineFns. We have some precedent for this with our Fns from
> >> MapToFields, where we accept things like inline lambdas and external
> >> references. Again the topic of how to configure them comes up, as
> >> these custom Fns are more likely to be parameterized than Map Fns
> >> (though, to be clear, perhaps it'd be good to allow parameterizatin of
> >> MapFns as well). Maybe we allow
> >>
> >> language: python. # like MapToFields (and here it'd be harder to mix
> >> and match per Fn)
> >> fn:
> >>   type: ???
> >>   # should these be nested as config?
> >>   name: fully.qualiied.name
> >>   path: /path/to/defining/file
> >>   args: [...]
> >>   kwargs: {...}
> >>
> >> which would invoke the constructor.
> >>
> >> I'm also open to other ways of naming/structuring these essential
> >> parameters if it makes things more clear.
> >>
> >> - Robert
> >>
> >>
> >> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> >> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> >> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
> >>
> >> [1] One can of course use SqlTransform for this, but I'm leaning
> >> towards offering something more native.
>

Re: [YAML] Aggregations

Reply via email to