On Fri, Sep 27, 2024 at 8:35 AM Joey Tran <[email protected]> wrote:
> Hey all, > > Just curious if this pattern comes up for others and if people have worked > out a good convention. > > There are many combiners and a lot of them have two forms: a global form > (e.g. Count.Globally) and a per key form (e.g. Count.PerKey). These are > convenient but it feels like often we're running into the case where we > GroupBy a set of data once and then wish to perform a series of combines on > them, in which case neither of these forms work, and it begs another form > which operates on pre-grouped KVs. > > Contrived example: maybe you have a pcollection of keyed numbers and you > want to calculate some summary statistics on them. You could do: > ``` > keyed_means = (keyed_nums > | Mean.PerKey()) > keyed_counts = (keyed_num > | Count.PerKey()) > ... # other combines > ``` > But it'd feel more natural to pre-group the pcollection. > Does it feel more natural because it feels as though it would be more performant? Because it seems like it adds an extra grouping step to the pipeline code, which otherwise might be not necessary. Note that Dataflow has the "combiner lifting" optimization, and combiner-specified-reduction happens before the data is written into shuffle as much as possible: https://cloud.google.com/dataflow/docs/pipeline-lifecycle#combine_optimization . > ``` > grouped_nums = keyed_nums | GBK() > keyed_means = (grouped_nums | Mean.PerGrouped()) > keyed_counts (grouped_nums | Count.PerGrouped()) > ``` > But these "PerGrouped" variants don't actually currently exist. Does > anyone else run into this pattern often? I might be missing an obvious > pattern here. > > -- > > Joey Tran | Staff Developer | AutoDesigner TL > > *he/him* > > [image: Schrödinger, Inc.] <https://schrodinger.com/> >
