Hi Girish,

Replies inline. We are getting into a very detailed discussion. We
could also discuss this topic in one of the upcoming Pulsar Community
meetings. However, I might miss the next meeting that is scheduled
this Thursday.
Although I am currently opposing to your proposal PIP-310, I am
supporting solving your problems related to rate limiting. :)
Let's continue the discussion since that is necessary so that we could
make progress. I hope this makes sense from your perspective.

On Sat, 4 Nov 2023 at 17:53, Girish Sharma <scrapmachi...@gmail.com> wrote:
>
> There are challenges in this. As explained in the PIP, there are several
> different usages of rate limiter, stats, unloading, etc. While I am open to
> having a burstable rate limiter in pulsar out of box, it might complicate
> things considering backward compatibility etc. More on this below.
>

I acknowledge that there are different usages, but my assumption is
that we could implement a generic solution that could be configured to
handle each specific use case.
I haven't yet seen any evidence that the requirements in your case are
so special that it justifies adding a pluggable interface for rate
limiters. Exposing yet another pluggable interface in Pulsar will add
complexity without gains. Each supported public interface is a
maintenance burden if we care about the quality of the exposed
interfaces and put effort in ensuring that the interfaces are
supported in future versions. Exposing an interface will also lock
down or slow down some future refactorings.
There will be a need to refactor and improve rate limiters as part of
the flow control and back pressure improvements within the Pulsar
broker. I'd rather keep the rate limiter internal interfaces an
internal implementation detail instead of leaking the details to an
exposed public interface. That's why we should primarily look for a
generic solution. I hope we could put effort in looking into the
characteristics of your requirements and attempt to sketch a design
for a generic solution that could be configured for your purposes.

> > The problems you are describing seem to be common to many Pulsar use cases,
> > and therefore, I think they should be handled directly in Pulsar.
> >
>
> I personally haven't seen many burstability related discussions; so this
> feature might actually not be that useful for all current Pulsar users.

In general there are not many advanced discussions on the mailing list
about the Pulsar internals.
It may also be difficult for others to recognize that specific
behaviors could be resolved by enhancing flow control, back pressure,
and rate limiting/throttling mechanisms.

> I would personally suggest we tackle this problem in parts so that it's
> available incrementally over versions rather than making the scope so big
> that it takes pulsar 4.0 for these features to land.

Sure, quick delivery time is the goal of everyone. Before talking
about schedules, we should be able to discuss the use case and the
design of the desired type of rate limiting and throttling in more
depth.

One concrete example of this is the desired behavior of bursting. In
token bucket rate limiting, bursting is about using the buffered
tokens in the "token bucket" and having a configurable limit for the
buffer (the "bucket"). This buffer will usually only contain tokens
when the actual rate has been lower than the configured maximum rate
for some duration.

However, there could be an expectation for a different type of
bursting which is more like "auto scaling" of the rate limit in a way
where the end-to-end latency of the produced messages
is taken into account. The expected behavior might be about scaling
the rate temporarily to a higher rate so that the queues can be
cleared and that the latency of the messages being sent stay under a
target latency. The current
org.apache.pulsar.broker.service.PublishRateLimiter interface cannot
control aspects that would be needed to handle this type of bursting
where we actually need to scale up the rate limit based on end-to-end
feedback . The PublishRateLimiter interface doesn't have feedback
loops currently.

I don't know what "bursting" means for you. Would it be possible to
provide concrete examples of desired behavior? That would be very
helpful in making progress.

> Yes, we were also thinking on the same terms once this is pluggable. The
> idea was to have some numbers and real world usage backing an
> implementation of rate limiter before merging it back into pulsar. Any
> decision we would take right now would be limited only by theoretical
> discussion of the implementation and our assumption that it covers 99% of
> the use cases, probably just like how the precise and poller ones came into
> being.

This will remain theoretical discussion without concrete examples of
desired behavior or what problem we want to solve.
Initially, we need to have a good grasp on the problem we want to
solve. The solution might change while we experiment and iterate. It
will be possible to improve over time, even after releasing the first
version as part of a Pulsar feature release.
The problem and intention doesn't usually change that much. Assuming
that there's some declarative configuration that configures a rate
limiter, that public "interface" is fairly stable.
Besides the current config of max throughput bytes and messages, there
could be parameters for defining the configuration of handling bursts
and other features. Bursting config could be something like
configuring the max rate for bursting and the duration that the burst
can last for with the maximum rate (there are many ways to express the
size of the "token bucket", another way would be to explicitly define
the size.).

One of the challenges of the existing solution design of the rate
limiters in the Pulsar code base is that it's hard to reason about the
design because of a solution which doesn't align with textbook
algorithms such as the token bucket algorithm.
Basing the design on the token bucket algorithm and making the
concepts in the implementation also match this would be a way to make
it more understandable.
It would be good to hear if others feel this way or not.

> > The current Pulsar rate limiter implementation could be implemented in a
> > cleaner way, which would also be more efficient. Instead of having a
> > scheduler call a method once per second, I think that the rate limiter
> > could be implemented in a reactive way where the algorithm is implemented
> > without a scheduler.
> > I wonder if there are others that would be interested in getting down into
> > such implementation details?
> >
>
> That would be touching RateLimiter.java class, while my goal is to
> improve/touch the outer classes, mainly around the interface
> PublishRateLimiter.java

It's interesting that you mention that you would like to improve the
PublishRateLimiter interface.
How would you change it?

> With this discussion taking a much bigger turn, how or where do we limit
> this PIP's scope? I am happy to work on follow up PIPs which may arrive out
> of this discussion.

We will be able to evaluate that later, after we have made some
conclusions in this discussion.

> > Mostly, the sources are hot sources and can't be slowed down. The lack of
> clear error message (client-server protocol limitation) is also another
> issue that I was planning to tackle in another PIP.

This is an important detail.

> There are a few other custom things. For example, there would be cases of
> short and medium term blacklisting of topics based on breach of rate
> limiter beyond a given SOP. I feel this is very very specific to our
> organization right now to be included inside pulsar itself.

This is outside of rate limiters. IIRC, there have been some
discussions in the community that an API for blocking individual
producers or producing to specific topics could be useful in some
cases.
An external component could observe metrics and control blocking if
there's an API for doing so.

> Actually, by default, and for 99% of the cases, multiplexing isn't an issue
> assuming:
> * A single producer object is producing to a single topic (one or more
> partition)
> * Produce is happening in a round robin manner (by default)

Connection multiplexing issue is also a problem in the cases you
listed since multiple partitions might be served by the same broker
and the connection to that broker could be shared. Producing in a
round-robin manner does not eliminate the issue because the sending
process is asynchronous in the background. Therefore, it is a problem
that should be addressed to get consistent behavior in all cases.
Having a way to isolate connections for each producer / consumer would
be the workaround before solving the producer flow control properly in
the Pulsar binary protocol so that it supports connection
multiplexing.

> Again, I would love this to land in pieces so that TAT for actual usage is
> much faster. What do you suggest from that perspective?

Since we don't backport new features to existing maintenance branches,
the only way to get this into a released version is to target the
master branch so that it gets included in the next feature release.
Our release policy is documented at
https://pulsar.apache.org/contribute/release-policy/ . The goal is to
release from master branch every 3 months.
Once the feature is included, we can implement improvements and bug
fixes in patch releases as needed.

Let's continue in getting into more details of the intended behavior
and define what bursting really means and how we believe that solves a
problem with "hot source" producers.
Makes sense?


-Lari

Reply via email to