Ah! Thanks for that catch. I had subscribed to the user mailing list but forgot to ever sub to the dev list
On Fri, Sep 22, 2023 at 10:03 AM Kenneth Knowles <k...@apache.org> wrote: > (I notice that you replied only to yourself, but there has been a whole > thread of discussion on this - are you subscribed to dev@beam? > https://lists.apache.org/thread/k81fq301ypwmjowknzyqq2qc63844rbd) > > It sounds like you want what everyone wants: to have the biggest bundles > possible. > > So for bounded data, basically you make even splits of the data and each > split is one bundle. And then dynamic splitting to redistribute work to > eliminate stragglers, if your engine has that capability. > > For unbounded data, you more-or-less bundle as much as you can without > waiting too long, like Jan described. > > Users know to put their high fixed costs in @StartBundle and then it is > the runner's job to put as many calls to @ProcessElement as possible to > amortize. > > Kenn > > On Fri, Sep 22, 2023 at 9:39 AM Joey Tran <joey.t...@schrodinger.com> > wrote: > >> Whoops, I typoed my last email. I meant to write "this isn't the >> greatest strategy for high *fixed* cost transforms", e.g. a transform >> that takes 5 minutes to get set up and then maybe a microsecond per input >> >> I suppose one solution is to move the responsibility for handling this >> kind of situation to the user and expect users to use a bundling transform >> (e.g. BatchElements [1]) followed by a Reshuffle+FlatMap. Is this what >> other runners expect? Just want to make sure I'm not missing some smart >> generic bundling strategy that might handle this for users. >> >> [1] >> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements >> >> >> On Thu, Sep 21, 2023 at 7:23 PM Joey Tran <joey.t...@schrodinger.com> >> wrote: >> >>> Writing a runner and the first strategy for determining bundling size >>> was to just start with a bundle size of one and double it until we reach a >>> size that we expect to take some targets per-bundle runtime (e.g. maybe 10 >>> minutes). I realize that this isn't the greatest strategy for high sized >>> cost transforms. I'm curious what kind of strategies other runners take? >>> >>