Hello all,

The pipeline I'm working on will be run against two different databases
that are both online so I will have to control the amount of data being
read/written to maintain QoS. To do so I would have to control the number
of workers that are being executed in parallel for a given step.

I saw this issue on Github: https://github.com/apache/beam/issues/17835.
However it has been closed and I don't see any related PRs/comments/etc.
Did that work get done, or did it just get cancelled?

I also saw this post:
https://medium.com/art-of-data-engineering/steady-as-she-flows-rate-limiting-in-apache-beam-pipelines-42cab0b7f31d.
That would definitely work, but it would result in workers spinning up and
waiting on locks (which costs money).

Perhaps this is more of a runner concern though, and I did see a way to
limit the maximum number of workers here:
https://cloud.google.com/dataflow/docs/reference/pipeline-options. I
believe that would also work, but then the max number of workers would be
dictated by whatever step has the highest performance cost which would
result in the pipelines being slower overall.

Thanks!

Reply via email to