+1 on your proposal. On Fri, Oct 20, 2023 at 4:59 PM Robert Bradshaw via dev <dev@beam.apache.org> wrote:
> On Fri, Oct 20, 2023 at 11:35 AM Kenneth Knowles <k...@apache.org> wrote: > > > > A couple other bits on having an expression language: > > > > - You already have Python lambdas at places, right? so that's quite a > lot more complex than SQL project/aggregate expressions > > - It really does save a lot of pain for users (at the cost of > implementation complexity) when you need to "SUM(col1*col2)" where > otherwise you have to Map first. This could be viewed as desirable as well, > of course. > > > > Anyhow I'm pretty much in agreement with all your reasoning as to why > *not* to use SQL-like expressions in strings. But it does seem odd when > juxtaposed with Python snippets. > > Well, we say "here's a Python expression" when we're using a Python > string. But "SUM(col1*col2)" isn't as transparent. (Agree about the > niceties of being able to provide an expression rather than a column.) > > > On Thu, Oct 19, 2023 at 4:00 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> > >> On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax <re...@google.com> wrote: > >> > > >> > Is the schema Group transform (in Java) something along these lines? > >> > >> Yes, for sure it is. It (and Python's and Typescript's equivalent) are > >> linked in the original post. The open question is how to best express > >> this in YAML. > >> > >> > On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> >> > >> >> Beam Yaml has good support for IOs and mappings, but one key missing > >> >> feature for even writing a WordCount is the ability to do > Aggregations > >> >> [1]. While the traditional Beam primitive is GroupByKey (and > >> >> CombineValues), we're eschewing KVs in the notion of more schema'd > >> >> data (which has some precedence in our other languages, see the links > >> >> below). The key components the user needs to specify are (1) the key > >> >> fields on which the grouping will take place, (2) the fields > >> >> (expressions?) involved in the aggregation, and (3) what aggregating > >> >> fn to use. > >> >> > >> >> A straw-man example could be something like > >> >> > >> >> type: Aggregating > >> >> config: > >> >> key: [field1, field2] > >> >> aggregating: > >> >> total_cost: > >> >> fn: sum > >> >> value: cost > >> >> max_cost: > >> >> fn: max > >> >> value: cost > >> >> > >> >> This would basically correspond to the SQL expression > >> >> > >> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as > max_cost > >> >> from table GROUP BY field1, field2" > >> >> > >> >> (though I'm not requiring that we use this as an implementation > >> >> strategy). I do not think we need a separate (non aggregating) > >> >> Grouping operation, this can be accomplished by having a concat-style > >> >> combiner. > >> >> > >> >> There are still some open questions here, notably around how to > >> >> specify the aggregation fns themselves. We could of course provide a > >> >> number of built-ins (like SQL does). This gets into the question of > >> >> how and where to document this complete set, but some basics should > >> >> take us pretty far. Many aggregators, however, are parameterized > (e.g. > >> >> quantiles); where do we put the parameters? We could go with > something > >> >> like > >> >> > >> >> fn: > >> >> type: ApproximateQuantiles > >> >> config: > >> >> n: 10 > >> >> > >> >> but others are even configured by functions themselves (e.g. LargestN > >> >> that wants a comparator Fn). Maybe we decide not to support these > >> >> (yet?) > >> >> > >> >> One thing I think we should support, however, is referencing custom > >> >> CombineFns. We have some precedent for this with our Fns from > >> >> MapToFields, where we accept things like inline lambdas and external > >> >> references. Again the topic of how to configure them comes up, as > >> >> these custom Fns are more likely to be parameterized than Map Fns > >> >> (though, to be clear, perhaps it'd be good to allow parameterizatin > of > >> >> MapFns as well). Maybe we allow > >> >> > >> >> language: python. # like MapToFields (and here it'd be harder to mix > >> >> and match per Fn) > >> >> fn: > >> >> type: ??? > >> >> # should these be nested as config? > >> >> name: fully.qualiied.name > >> >> path: /path/to/defining/file > >> >> args: [...] > >> >> kwargs: {...} > >> >> > >> >> which would invoke the constructor. > >> >> > >> >> I'm also open to other ways of naming/structuring these essential > >> >> parameters if it makes things more clear. > >> >> > >> >> - Robert > >> >> > >> >> > >> >> Java: > https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html > >> >> Python: > https://beam.apache.org/documentation/transforms/python/aggregation/groupby > >> >> Typescript: > https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html > >> >> > >> >> [1] One can of course use SqlTransform for this, but I'm leaning > >> >> towards offering something more native. >