Hello all, The pipeline I'm working on will be run against two different databases that are both online so I will have to control the amount of data being read/written to maintain QoS. To do so I would have to control the number of workers that are being executed in parallel for a given step.
I saw this issue on Github: https://github.com/apache/beam/issues/17835. However it has been closed and I don't see any related PRs/comments/etc. Did that work get done, or did it just get cancelled? I also saw this post: https://medium.com/art-of-data-engineering/steady-as-she-flows-rate-limiting-in-apache-beam-pipelines-42cab0b7f31d. That would definitely work, but it would result in workers spinning up and waiting on locks (which costs money). Perhaps this is more of a runner concern though, and I did see a way to limit the maximum number of workers here: https://cloud.google.com/dataflow/docs/reference/pipeline-options. I believe that would also work, but then the max number of workers would be dictated by whatever step has the highest performance cost which would result in the pipelines being slower overall. Thanks!