Hi folks, I've been working on a custom PTransform that makes requests to another service, and would like to add a rate limiting feature there. The fundamental issue that I'm running into here is that I need a decent heuristic to estimate the worker count, so that each worker can independently set a limit which globally comes out to the right value. All of this is easy if I know how many machines I have, but I'd like to use Dataflow's autoscaling, which would easily break any pre-configured value. I have seen two main approaches for rate limiting, both for a configurable variable x:
- Simply assume worker count is x, then divide by x to figure out the "local" limit. The issue I have here is that if we assume x is 500, but it is actually 50, I'm now paying for 50 nodes to throttle 10 times as much as necessary. I know the pipeline options have a reference to the runner, is it possible to get an approximate current worker count from that at bundle start (*if* runner is DataflowRunner)? - Add another PTransform in front of the API requests, which groups by x number of keys, throttles, and keeps forwarding elements with an instant trigger. I initially really liked this solution because even if x is misconfigured, I will have at most x workers running and throttle appropriately. However, I noticed that for batch pipelines, this effectively also caps the API request stage at x workers. If I throw in a `Reshuffle`, there is another GroupByKey (-> another stage), and nothing gets done until every element has passed through the throttler. Has anyone here tried to figure out rate limiting with Beam before, and perhaps run into similar issues? I would love to know if there is a preferred solution to this type of problem. I know sharing state like that runs a little counter to the Beam pipeline paradigm, but really all I need is an approximate worker count with few guarantees. Cheers, Daniel