Ah! Thanks for that catch. I had subscribed to the user mailing list but
forgot to ever sub to the dev list

On Fri, Sep 22, 2023 at 10:03 AM Kenneth Knowles <k...@apache.org> wrote:

> (I notice that you replied only to yourself, but there has been a whole
> thread of discussion on this - are you subscribed to dev@beam?
> https://lists.apache.org/thread/k81fq301ypwmjowknzyqq2qc63844rbd)
>
> It sounds like you want what everyone wants: to have the biggest bundles
> possible.
>
> So for bounded data, basically you make even splits of the data and each
> split is one bundle. And then dynamic splitting to redistribute work to
> eliminate stragglers, if your engine has that capability.
>
> For unbounded data, you more-or-less bundle as much as you can without
> waiting too long, like Jan described.
>
> Users know to put their high fixed costs in @StartBundle and then it is
> the runner's job to put as many calls to @ProcessElement as possible to
> amortize.
>
> Kenn
>
> On Fri, Sep 22, 2023 at 9:39 AM Joey Tran <joey.t...@schrodinger.com>
> wrote:
>
>> Whoops, I typoed my last email. I meant to write "this isn't the
>> greatest strategy for high *fixed* cost transforms", e.g. a transform
>> that takes 5 minutes to get set up and then maybe a microsecond per input
>>
>> I suppose one solution is to move the responsibility for handling this
>> kind of situation to the user and expect users to use a bundling transform
>> (e.g. BatchElements [1]) followed by a Reshuffle+FlatMap. Is this what
>> other runners expect? Just want to make sure I'm not missing some smart
>> generic bundling strategy that might handle this for users.
>>
>> [1]
>> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements
>>
>>
>> On Thu, Sep 21, 2023 at 7:23 PM Joey Tran <joey.t...@schrodinger.com>
>> wrote:
>>
>>> Writing a runner and the first strategy for determining bundling size
>>> was to just start with a bundle size of one and double it until we reach a
>>> size that we expect to take some targets per-bundle runtime (e.g. maybe 10
>>> minutes). I realize that this isn't the greatest strategy for high sized
>>> cost transforms. I'm curious what kind of strategies other runners take?
>>>
>>

Reply via email to