On 10/19/23 19:41, Robert Bradshaw via dev wrote:
On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský <je...@seznam.cz> wrote:
On 10/19/23 18:28, Robert Bradshaw via dev wrote:
On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis <byronel...@google.com> wrote:
Rill is definitely SQL-oriented but I think that's going to be the most common. 
Dataframes are explicitly modeled on the relational approach so that's going to 
look a lot like SQL,
I think pretty much any approach that fits here is going to be
relational, meaning you choose a set of columns to group on, a set of
columns to aggregate, and how to aggregate. The big open question is
what syntax to use for the "how."
This might be already answered, if so, pardon my ignorance, but what is
the goal this declarative approach is trying to solve? Is it meant to be
more expressive or equally expressive than SQL? And if more, how much more?
I'm not sure if you're asking about YAML in general, or the particular
case of aggregation, but I can answer both.

For the larger Beam YAML project, it's trying to solve the problem
that SQL is (and I'll admit this is somewhat subjective here) good at
expressing the T part of ETL, but not the other parts. For example,
the simple data movent usecase of (say) reading from PubSub and
dumping into BigQuery is not well expressed in terms of SQL. SQL is
also fairly awkward when it comes to defining UDFs and TDFs and
non-linear pipelines (especially those with fanout). There are of
course other tools in this space (dbt comes to mind, and there's been
some investigation on how to make dbt play well with Beam). The other
niche it is trying to solve is that installing and learning a full SDK
is heavyweight and overkill for creating pipelines that are simply
wiring together pre-defined transforms.

I think FlinkSQL solves the problem of E and L in SQL via CREATE TABLE and INSERT statements. I agree with the fanout part, though it could be possible to use CREATE (TEMPORARY) TABLE AS SELECT ... could solve that as well.

As for the more narrow case of aggregations, I think being similarly
expressive as SQL is fine, though it'd be good to make custom UADFs
more natural. Originally I was thinking that just having SqlTransform
might be sufficient, but it feels like a big hammer to reach for every
time I just want to sum over one or two columns.

Yes, defining UDFs and UDAFs is painful, that was the motivation of my question. It also defines how the syntax for such UDAF would need to look like. It would require to break UDAFs down to several primitive UDFs and then use a functional style to declare them. Most of the time it would be probably sufficient to use simplified CombineFn semantics with accumulator being limited to a primitive type (long, double, string, maybe array?). I suppose declaring a full-blown stateful DoFn (timers, generic state, ...) is out of scope.

Reply via email to