Re: [YAML] Aggregations

Jan Lukavský Thu, 19 Oct 2023 10:25:20 -0700

On 10/19/23 18:28, Robert Bradshaw via dev wrote:

On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis <[email protected]> wrote:

Rill is definitely SQL-oriented but I think that's going to be the most common. 
Dataframes are explicitly modeled on the relational approach so that's going to 
look a lot like SQL,

I think pretty much any approach that fits here is going to be
relational, meaning you choose a set of columns to group on, a set of
columns to aggregate, and how to aggregate. The big open question is
what syntax to use for the "how."

This might be already answered, if so, pardon my ignorance, but what isthe goal this declarative approach is trying to solve? Is it meant to bemore expressive or equally expressive than SQL? And if more, how much more?


Dataframe aggregation is probably a good example to look at. Here we
have panda and R in particular as concrete instances. It should also
be easy to support different aggregations over different (or the same)
columns. Pandas can take a list of (or mapping to) functions in its
groupby().agg(). R doesn't seem to make this very easy...

which leaves us with S-style formulas (which I like but are pretty niche)

I'm curious, what are these?

  and I guess pivot tables coming from the spreadsheet world. Does make me 
wonder what Rails' ORM looks like these days (I last used v4), it had some 
aggregation support and was pretty declarative...

On Wed, Oct 18, 2023 at 6:06 PM Robert Bradshaw <[email protected]> wrote:

On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis <[email protected]> wrote:

Is it worth taking a look at similar prior art in the space?

+1. Pointers welcome.

The first one that comes to mind is Transform, but with the dbt labs 
acquisition that spec is a lot harder to find. Rill is pretty similar though.

Rill seems to be very SQL-based.

On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev <[email protected]> 
wrote:

Beam Yaml has good support for IOs and mappings, but one key missing
feature for even writing a WordCount is the ability to do Aggregations
[1]. While the traditional Beam primitive is GroupByKey (and
CombineValues), we're eschewing KVs in the notion of more schema'd
data (which has some precedence in our other languages, see the links
below). The key components the user needs to specify are (1) the key
fields on which the grouping will take place, (2) the fields
(expressions?) involved in the aggregation, and (3) what aggregating
fn to use.

A straw-man example could be something like

type: Aggregating
config:
key: [field1, field2]
aggregating:
total_cost:
fn: sum
value: cost
max_cost:
fn: max
value: cost

This would basically correspond to the SQL expression

"SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
from table GROUP BY field1, field2"

(though I'm not requiring that we use this as an implementation
strategy). I do not think we need a separate (non aggregating)
Grouping operation, this can be accomplished by having a concat-style
combiner.

There are still some open questions here, notably around how to
specify the aggregation fns themselves. We could of course provide a
number of built-ins (like SQL does). This gets into the question of
how and where to document this complete set, but some basics should
take us pretty far. Many aggregators, however, are parameterized (e.g.
quantiles); where do we put the parameters? We could go with something
like

fn:
type: ApproximateQuantiles
config:
n: 10

but others are even configured by functions themselves (e.g. LargestN
that wants a comparator Fn). Maybe we decide not to support these
(yet?)

One thing I think we should support, however, is referencing custom
CombineFns. We have some precedent for this with our Fns from
MapToFields, where we accept things like inline lambdas and external
references. Again the topic of how to configure them comes up, as
these custom Fns are more likely to be parameterized than Map Fns
(though, to be clear, perhaps it'd be good to allow parameterizatin of
MapFns as well). Maybe we allow

language: python. # like MapToFields (and here it'd be harder to mix
and match per Fn)
fn:
type: ???
# should these be nested as config?
name: fully.qualiied.name
path: /path/to/defining/file
args: [...]
kwargs: {...}

which would invoke the constructor.

I'm also open to other ways of naming/structuring these essential
parameters if it makes things more clear.

- Robert

Java:
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
Python:
https://beam.apache.org/documentation/transforms/python/aggregation/groupby
Typescript:
https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html

[1] One can of course use SqlTransform for this, but I'm leaning
towards offering something more native.

Re: [YAML] Aggregations

Reply via email to