Hi folks,

I've been working on a custom PTransform that makes requests to another
service, and would like to add a rate limiting feature there. The
fundamental issue that I'm running into here is that I need a decent
heuristic to estimate the worker count, so that each worker can
independently set a limit which globally comes out to the right value. All
of this is easy if I know how many machines I have, but I'd like to use
Dataflow's autoscaling, which would easily break any pre-configured value.
I have seen two main approaches for rate limiting, both for a configurable
variable x:

   - Simply assume worker count is x, then divide by x to figure out the
   "local" limit. The issue I have here is that if we assume x is 500, but it
   is actually 50, I'm now paying for 50 nodes to throttle 10 times as much as
   necessary. I know the pipeline options have a reference to the runner, is
   it possible to get an approximate current worker count from that at bundle
   start (*if* runner is DataflowRunner)?
   - Add another PTransform in front of the API requests, which groups by x
   number of keys, throttles, and keeps forwarding elements with an instant
   trigger. I initially really liked this solution because even if x is
   misconfigured, I will have at most x workers running and throttle
   appropriately. However, I noticed that for batch pipelines, this
   effectively also caps the API request stage at x workers. If I throw in a
   `Reshuffle`, there is another GroupByKey (-> another stage), and nothing
   gets done until every element has passed through the throttler.

Has anyone here tried to figure out rate limiting with Beam before, and
perhaps run into similar issues? I would love to know if there is a
preferred solution to this type of problem.
I know sharing state like that runs a little counter to the Beam pipeline
paradigm, but really all I need is an approximate worker count with few
guarantees.

Cheers,
Daniel

Reply via email to