MongoDB has its own concept of aggregation pipelines as well. https://www.mongodb.com/docs/manual/core/aggregation-pipeline/#std-label-aggregation-pipeline
On Wed, Oct 18, 2023, 6:07 PM Robert Bradshaw via dev <dev@beam.apache.org> wrote: > On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis <byronel...@google.com> wrote: > > > > Is it worth taking a look at similar prior art in the space? > > +1. Pointers welcome. > > > The first one that comes to mind is Transform, but with the dbt labs > acquisition that spec is a lot harder to find. Rill is pretty similar > though. > > Rill seems to be very SQL-based. > > > On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> > >> Beam Yaml has good support for IOs and mappings, but one key missing > >> feature for even writing a WordCount is the ability to do Aggregations > >> [1]. While the traditional Beam primitive is GroupByKey (and > >> CombineValues), we're eschewing KVs in the notion of more schema'd > >> data (which has some precedence in our other languages, see the links > >> below). The key components the user needs to specify are (1) the key > >> fields on which the grouping will take place, (2) the fields > >> (expressions?) involved in the aggregation, and (3) what aggregating > >> fn to use. > >> > >> A straw-man example could be something like > >> > >> type: Aggregating > >> config: > >> key: [field1, field2] > >> aggregating: > >> total_cost: > >> fn: sum > >> value: cost > >> max_cost: > >> fn: max > >> value: cost > >> > >> This would basically correspond to the SQL expression > >> > >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost > >> from table GROUP BY field1, field2" > >> > >> (though I'm not requiring that we use this as an implementation > >> strategy). I do not think we need a separate (non aggregating) > >> Grouping operation, this can be accomplished by having a concat-style > >> combiner. > >> > >> There are still some open questions here, notably around how to > >> specify the aggregation fns themselves. We could of course provide a > >> number of built-ins (like SQL does). This gets into the question of > >> how and where to document this complete set, but some basics should > >> take us pretty far. Many aggregators, however, are parameterized (e.g. > >> quantiles); where do we put the parameters? We could go with something > >> like > >> > >> fn: > >> type: ApproximateQuantiles > >> config: > >> n: 10 > >> > >> but others are even configured by functions themselves (e.g. LargestN > >> that wants a comparator Fn). Maybe we decide not to support these > >> (yet?) > >> > >> One thing I think we should support, however, is referencing custom > >> CombineFns. We have some precedent for this with our Fns from > >> MapToFields, where we accept things like inline lambdas and external > >> references. Again the topic of how to configure them comes up, as > >> these custom Fns are more likely to be parameterized than Map Fns > >> (though, to be clear, perhaps it'd be good to allow parameterizatin of > >> MapFns as well). Maybe we allow > >> > >> language: python. # like MapToFields (and here it'd be harder to mix > >> and match per Fn) > >> fn: > >> type: ??? > >> # should these be nested as config? > >> name: fully.qualiied.name > >> path: /path/to/defining/file > >> args: [...] > >> kwargs: {...} > >> > >> which would invoke the constructor. > >> > >> I'm also open to other ways of naming/structuring these essential > >> parameters if it makes things more clear. > >> > >> - Robert > >> > >> > >> Java: > https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html > >> Python: > https://beam.apache.org/documentation/transforms/python/aggregation/groupby > >> Typescript: > https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html > >> > >> [1] One can of course use SqlTransform for this, but I'm leaning > >> towards offering something more native. >