Re: [YAML] Aggregations

Byron Ellis via dev Wed, 18 Oct 2023 17:06:29 -0700

Is it worth taking a look at similar prior art in the space? The first one
that comes to mind is Transform, but with the dbt labs acquisition that
spec is a lot harder to find. Rill
<https://docs.rilldata.com/example-projects> is pretty similar though.


On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev <dev@beam.apache.org>
wrote:

> Beam Yaml has good support for IOs and mappings, but one key missing
> feature for even writing a WordCount is the ability to do Aggregations
> [1]. While the traditional Beam primitive is GroupByKey (and
> CombineValues), we're eschewing KVs in the notion of more schema'd
> data (which has some precedence in our other languages, see the links
> below). The key components the user needs to specify are (1) the key
> fields on which the grouping will take place, (2) the fields
> (expressions?) involved in the aggregation, and (3) what aggregating
> fn to use.
>
> A straw-man example could be something like
>
> type: Aggregating
> config:
>   key: [field1, field2]
>   aggregating:
>     total_cost:
>       fn: sum
>       value: cost
>     max_cost:
>       fn: max
>       value: cost
>
> This would basically correspond to the SQL expression
>
> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
> from table GROUP BY field1, field2"
>
> (though I'm not requiring that we use this as an implementation
> strategy). I do not think we need a separate (non aggregating)
> Grouping operation, this can be accomplished by having a concat-style
> combiner.
>
> There are still some open questions here, notably around how to
> specify the aggregation fns themselves. We could of course provide a
> number of built-ins (like SQL does). This gets into the question of
> how and where to document this complete set, but some basics should
> take us pretty far. Many aggregators, however, are parameterized (e.g.
> quantiles); where do we put the parameters? We could go with something
> like
>
> fn:
>   type: ApproximateQuantiles
>   config:
>     n: 10
>
> but others are even configured by functions themselves (e.g. LargestN
> that wants a comparator Fn). Maybe we decide not to support these
> (yet?)
>
> One thing I think we should support, however, is referencing custom
> CombineFns. We have some precedent for this with our Fns from
> MapToFields, where we accept things like inline lambdas and external
> references. Again the topic of how to configure them comes up, as
> these custom Fns are more likely to be parameterized than Map Fns
> (though, to be clear, perhaps it'd be good to allow parameterizatin of
> MapFns as well). Maybe we allow
>
> language: python. # like MapToFields (and here it'd be harder to mix
> and match per Fn)
> fn:
>   type: ???
>   # should these be nested as config?
>   name: fully.qualiied.name
>   path: /path/to/defining/file
>   args: [...]
>   kwargs: {...}
>
> which would invoke the constructor.
>
> I'm also open to other ways of naming/structuring these essential
> parameters if it makes things more clear.
>
> - Robert
>
>
> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
>
> [1] One can of course use SqlTransform for this, but I'm leaning
> towards offering something more native.
>

Re: [YAML] Aggregations

Reply via email to