Re: [YAML] Aggregations

Kenneth Knowles Mon, 30 Oct 2023 13:49:45 -0700

Automatically dereferencing, basically. It is nice. Especially for
many-to-many relationships like the example. I don't know if the
aggregation is any different though, is it?


Kenn

On Sun, Oct 29, 2023 at 1:12 PM Robert Burke <rob...@frantil.com> wrote:

> I came across Edge DB, and it has a novel syntax moving away from SQL with
> their EdgeQL.
>
> https://www.edgedb.com/
>
> Eg. Heere are two equivalent "nested" queries.
>
>
> # EdgeQL
>
> select Movie {
>   title,
>   actors: {
>    name
>   },
>   rating := math::mean(.reviews.score)
> } filter "Zendaya" in .actors.name;
>
>
> # SQL
>
> SELECT
>   title,
>   Actors.name AS actor_name,
>   (SELECT avg(score)
>     FROM Movie_Reviews
>     WHERE movie_id = Movie.id) AS rating
> FROM
>   Movie
>   LEFT JOIN Movie_Actors ON
>     Movie.id = Movie_Actors.movie_id
>   LEFT JOIN Person AS Actors ON
>     Movie_Actors.person_id = Person.id
> WHERE
>   'Zendaya' IN (
>     SELECT Person.name
>     FROM
>       Movie_Actors
>       INNER JOIN Person
>         ON Movie_Actors.person_id = Person.id
>     WHERE
>       Movie_Actors.movie_id = Movie.id)
>
>
> The key observations here are specifics around join kinds and stuff don't
> often need to be directly expressed in the query.
>
> I'd need to dig deeper around it (such as do they share... ) but it does
> do a nice first impression of demos.
>
>
> On Mon, Oct 23, 2023, 7:00 AM XQ Hu via dev <dev@beam.apache.org> wrote:
>
>> +1 on your proposal.
>>
>> On Fri, Oct 20, 2023 at 4:59 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> On Fri, Oct 20, 2023 at 11:35 AM Kenneth Knowles <k...@apache.org>
>>> wrote:
>>> >
>>> > A couple other bits on having an expression language:
>>> >
>>> >  - You already have Python lambdas at places, right? so that's quite a
>>> lot more complex than SQL project/aggregate expressions
>>> >  - It really does save a lot of pain for users (at the cost of
>>> implementation complexity) when you need to "SUM(col1*col2)" where
>>> otherwise you have to Map first. This could be viewed as desirable as well,
>>> of course.
>>> >
>>> > Anyhow I'm pretty much in agreement with all your reasoning as to why
>>> *not* to use SQL-like expressions in strings. But it does seem odd when
>>> juxtaposed with Python snippets.
>>>
>>> Well, we say "here's a Python expression" when we're using a Python
>>> string. But "SUM(col1*col2)" isn't as transparent. (Agree about the
>>> niceties of being able to provide an expression rather than a column.)
>>>
>>> > On Thu, Oct 19, 2023 at 4:00 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>> >>
>>> >> On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax <re...@google.com> wrote:
>>> >> >
>>> >> > Is the schema Group transform (in Java) something along these lines?
>>> >>
>>> >> Yes, for sure it is. It (and Python's and Typescript's equivalent) are
>>> >> linked in the original post. The open question is how to best express
>>> >> this in YAML.
>>> >>
>>> >> > On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>> >> >>
>>> >> >> Beam Yaml has good support for IOs and mappings, but one key
>>> missing
>>> >> >> feature for even writing a WordCount is the ability to do
>>> Aggregations
>>> >> >> [1]. While the traditional Beam primitive is GroupByKey (and
>>> >> >> CombineValues), we're eschewing KVs in the notion of more schema'd
>>> >> >> data (which has some precedence in our other languages, see the
>>> links
>>> >> >> below). The key components the user needs to specify are (1) the
>>> key
>>> >> >> fields on which the grouping will take place, (2) the fields
>>> >> >> (expressions?) involved in the aggregation, and (3) what
>>> aggregating
>>> >> >> fn to use.
>>> >> >>
>>> >> >> A straw-man example could be something like
>>> >> >>
>>> >> >> type: Aggregating
>>> >> >> config:
>>> >> >>   key: [field1, field2]
>>> >> >>   aggregating:
>>> >> >>     total_cost:
>>> >> >>       fn: sum
>>> >> >>       value: cost
>>> >> >>     max_cost:
>>> >> >>       fn: max
>>> >> >>       value: cost
>>> >> >>
>>> >> >> This would basically correspond to the SQL expression
>>> >> >>
>>> >> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as
>>> max_cost
>>> >> >> from table GROUP BY field1, field2"
>>> >> >>
>>> >> >> (though I'm not requiring that we use this as an implementation
>>> >> >> strategy). I do not think we need a separate (non aggregating)
>>> >> >> Grouping operation, this can be accomplished by having a
>>> concat-style
>>> >> >> combiner.
>>> >> >>
>>> >> >> There are still some open questions here, notably around how to
>>> >> >> specify the aggregation fns themselves. We could of course provide
>>> a
>>> >> >> number of built-ins (like SQL does). This gets into the question of
>>> >> >> how and where to document this complete set, but some basics should
>>> >> >> take us pretty far. Many aggregators, however, are parameterized
>>> (e.g.
>>> >> >> quantiles); where do we put the parameters? We could go with
>>> something
>>> >> >> like
>>> >> >>
>>> >> >> fn:
>>> >> >>   type: ApproximateQuantiles
>>> >> >>   config:
>>> >> >>     n: 10
>>> >> >>
>>> >> >> but others are even configured by functions themselves (e.g.
>>> LargestN
>>> >> >> that wants a comparator Fn). Maybe we decide not to support these
>>> >> >> (yet?)
>>> >> >>
>>> >> >> One thing I think we should support, however, is referencing custom
>>> >> >> CombineFns. We have some precedent for this with our Fns from
>>> >> >> MapToFields, where we accept things like inline lambdas and
>>> external
>>> >> >> references. Again the topic of how to configure them comes up, as
>>> >> >> these custom Fns are more likely to be parameterized than Map Fns
>>> >> >> (though, to be clear, perhaps it'd be good to allow
>>> parameterizatin of
>>> >> >> MapFns as well). Maybe we allow
>>> >> >>
>>> >> >> language: python. # like MapToFields (and here it'd be harder to
>>> mix
>>> >> >> and match per Fn)
>>> >> >> fn:
>>> >> >>   type: ???
>>> >> >>   # should these be nested as config?
>>> >> >>   name: fully.qualiied.name
>>> >> >>   path: /path/to/defining/file
>>> >> >>   args: [...]
>>> >> >>   kwargs: {...}
>>> >> >>
>>> >> >> which would invoke the constructor.
>>> >> >>
>>> >> >> I'm also open to other ways of naming/structuring these essential
>>> >> >> parameters if it makes things more clear.
>>> >> >>
>>> >> >> - Robert
>>> >> >>
>>> >> >>
>>> >> >> Java:
>>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
>>> >> >> Python:
>>> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
>>> >> >> Typescript:
>>> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
>>> >> >>
>>> >> >> [1] One can of course use SqlTransform for this, but I'm leaning
>>> >> >> towards offering something more native.
>>>
>>

Re: [YAML] Aggregations

Reply via email to