Is it worth taking a look at similar prior art in the space? The first one that comes to mind is Transform, but with the dbt labs acquisition that spec is a lot harder to find. Rill <https://docs.rilldata.com/example-projects> is pretty similar though.
On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev <dev@beam.apache.org> wrote: > Beam Yaml has good support for IOs and mappings, but one key missing > feature for even writing a WordCount is the ability to do Aggregations > [1]. While the traditional Beam primitive is GroupByKey (and > CombineValues), we're eschewing KVs in the notion of more schema'd > data (which has some precedence in our other languages, see the links > below). The key components the user needs to specify are (1) the key > fields on which the grouping will take place, (2) the fields > (expressions?) involved in the aggregation, and (3) what aggregating > fn to use. > > A straw-man example could be something like > > type: Aggregating > config: > key: [field1, field2] > aggregating: > total_cost: > fn: sum > value: cost > max_cost: > fn: max > value: cost > > This would basically correspond to the SQL expression > > "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost > from table GROUP BY field1, field2" > > (though I'm not requiring that we use this as an implementation > strategy). I do not think we need a separate (non aggregating) > Grouping operation, this can be accomplished by having a concat-style > combiner. > > There are still some open questions here, notably around how to > specify the aggregation fns themselves. We could of course provide a > number of built-ins (like SQL does). This gets into the question of > how and where to document this complete set, but some basics should > take us pretty far. Many aggregators, however, are parameterized (e.g. > quantiles); where do we put the parameters? We could go with something > like > > fn: > type: ApproximateQuantiles > config: > n: 10 > > but others are even configured by functions themselves (e.g. LargestN > that wants a comparator Fn). Maybe we decide not to support these > (yet?) > > One thing I think we should support, however, is referencing custom > CombineFns. We have some precedent for this with our Fns from > MapToFields, where we accept things like inline lambdas and external > references. Again the topic of how to configure them comes up, as > these custom Fns are more likely to be parameterized than Map Fns > (though, to be clear, perhaps it'd be good to allow parameterizatin of > MapFns as well). Maybe we allow > > language: python. # like MapToFields (and here it'd be harder to mix > and match per Fn) > fn: > type: ??? > # should these be nested as config? > name: fully.qualiied.name > path: /path/to/defining/file > args: [...] > kwargs: {...} > > which would invoke the constructor. > > I'm also open to other ways of naming/structuring these essential > parameters if it makes things more clear. > > - Robert > > > Java: > https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html > Python: > https://beam.apache.org/documentation/transforms/python/aggregation/groupby > Typescript: > https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html > > [1] One can of course use SqlTransform for this, but I'm leaning > towards offering something more native. >