Re: [YAML] Aggregations

Robert Bradshaw via dev Thu, 19 Oct 2023 12:13:36 -0700

On Thu, Oct 19, 2023 at 11:12 AM Kenneth Knowles <[email protected]> wrote:
>
> Using SQL expressions in strings is maybe OK given we are all
> relational all the time. Either way you have to define what the
> universe of `fn` is. Here's a compact possibility:
>
> type: Combine
> config:
>   group_by: [field1, field2]
>   aggregates:
>     max_cost: "MAX(cost)"
>     total_cost: "SUM(cost)"
>
> Just a thought to get it closer to SQL concision.


So I'm a bit wary of having to parse these strings (unless a language
parameter is passed in which case we defer to that language's syntax).
It's also messy the other way around, if a tool generates the YAML I'd
rather it not have to generate strings like this (i.e. string literals
should either be identifiers or opaque blobs).

Pandas achieves consciousness by allowing one to just specify, say,
'sum' and implicitly summing over all (numeric) fields, and allowing
more verbose, precise specification as well.

> I also used the word
> "Combine" just to connect it to other Beam writings and whatnot.

+1

> On Thu, Oct 19, 2023 at 1:41 PM Robert Bradshaw via dev
> <[email protected]> wrote:
> >
> > On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský <[email protected]> wrote:
> > >
> > > On 10/19/23 18:28, Robert Bradshaw via dev wrote:
> > > > On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis <[email protected]> 
> > > > wrote:
> > > >> Rill is definitely SQL-oriented but I think that's going to be the 
> > > >> most common. Dataframes are explicitly modeled on the relational 
> > > >> approach so that's going to look a lot like SQL,
> > > > I think pretty much any approach that fits here is going to be
> > > > relational, meaning you choose a set of columns to group on, a set of
> > > > columns to aggregate, and how to aggregate. The big open question is
> > > > what syntax to use for the "how."
> > > This might be already answered, if so, pardon my ignorance, but what is
> > > the goal this declarative approach is trying to solve? Is it meant to be
> > > more expressive or equally expressive than SQL? And if more, how much 
> > > more?
> >
> > I'm not sure if you're asking about YAML in general, or the particular
> > case of aggregation, but I can answer both.
> >
> > For the larger Beam YAML project, it's trying to solve the problem
> > that SQL is (and I'll admit this is somewhat subjective here) good at
> > expressing the T part of ETL, but not the other parts. For example,
> > the simple data movent usecase of (say) reading from PubSub and
> > dumping into BigQuery is not well expressed in terms of SQL. SQL is
> > also fairly awkward when it comes to defining UDFs and TDFs and
> > non-linear pipelines (especially those with fanout). There are of
> > course other tools in this space (dbt comes to mind, and there's been
> > some investigation on how to make dbt play well with Beam). The other
> > niche it is trying to solve is that installing and learning a full SDK
> > is heavyweight and overkill for creating pipelines that are simply
> > wiring together pre-defined transforms.
> >
> > As for the more narrow case of aggregations, I think being similarly
> > expressive as SQL is fine, though it'd be good to make custom UADFs
> > more natural. Originally I was thinking that just having SqlTransform
> > might be sufficient, but it feels like a big hammer to reach for every
> > time I just want to sum over one or two columns.

Re: [YAML] Aggregations

Reply via email to