On Thu, Dec 5, 2024 at 4:03 PM Joey Tran <joey.t...@schrodinger.com> wrote:
>
> Is there any kind of multiprocess parallelism built into the Python sdk 
> worker? In other words, is there a way for my runner to start a worker and 
> have it has multiple cores instead of having one worker per core? I thought I 
> saw this capability somewhere but now that I look I can only see the pipeline 
> option for it but nothing in sdk_worker or sdk_worker_main.

The SDK worker does handle multiple bundles in multiple threads, but
the problem is that the GIL limits the amount of parallelism that can
be achieved without using multiple processes. This is why the Python
container also starts up multiple processes.

> I'm starting to think about how to scale into the 10,000s of running workers 
> and I'm concerned that having 10,000 threads on the driver to maintain the 
> grpc connections will be a bit much to manage.

One should be able to start only as many workers as there are cores,
and then multiplex any number of threads onto them. (I assume that
you're not running this on a 10,000 core machine...)

I don't know if there are pipeline options for this, but the sibling
workers field of the provisioning API [1] is used to determine the
number of workers [2] which then each connect to the driver process
which can give then an unbounded number of bundles each.

[1] 
https://github.com/apache/beam/blob/release-2.61.0/model/fn-execution/src/main/proto/org/apache/beam/model/fn_execution/v1/beam_provision_api.proto#L92
[2] 
https://github.com/apache/beam/blob/release-2.61.0/sdks/python/container/boot.go#L242

Reply via email to