On Thu, Dec 5, 2024 at 4:03 PM Joey Tran <joey.t...@schrodinger.com> wrote: > > Is there any kind of multiprocess parallelism built into the Python sdk > worker? In other words, is there a way for my runner to start a worker and > have it has multiple cores instead of having one worker per core? I thought I > saw this capability somewhere but now that I look I can only see the pipeline > option for it but nothing in sdk_worker or sdk_worker_main.
The SDK worker does handle multiple bundles in multiple threads, but the problem is that the GIL limits the amount of parallelism that can be achieved without using multiple processes. This is why the Python container also starts up multiple processes. > I'm starting to think about how to scale into the 10,000s of running workers > and I'm concerned that having 10,000 threads on the driver to maintain the > grpc connections will be a bit much to manage. One should be able to start only as many workers as there are cores, and then multiplex any number of threads onto them. (I assume that you're not running this on a 10,000 core machine...) I don't know if there are pipeline options for this, but the sibling workers field of the provisioning API [1] is used to determine the number of workers [2] which then each connect to the driver process which can give then an unbounded number of bundles each. [1] https://github.com/apache/beam/blob/release-2.61.0/model/fn-execution/src/main/proto/org/apache/beam/model/fn_execution/v1/beam_provision_api.proto#L92 [2] https://github.com/apache/beam/blob/release-2.61.0/sdks/python/container/boot.go#L242