Re: [DISCUSS] PIP-174: Provide new implementation for broker dispatch cache

Enrico Olivelli Tue, 07 Jun 2022 06:37:04 -0700

Great idea.
I wonder which kind of metrics we could have. To see how each
tenant/namespace is using the cache
Enrico


Il giorno mar 7 giu 2022 alle ore 15:12 PengHui Li
<codelipeng...@gmail.com> ha scritto:
>
> +1
>
> Penghui
> On Jun 7, 2022, 08:48 +0800, Matteo Merli <matteo.me...@gmail.com>, wrote:
> > https://github.com/apache/pulsar/issues/15954
> >
> > WIP can be seen at: https://github.com/apache/pulsar/pull/15955
> >
> > -----------
> >
> >
> > ## Motivation
> >
> > The current implementation of the read cache in the Pulsar broker has 
> > largely
> > remained unchanged for a long time, except for a few minor tweaks.
> >
> > While the implementation is stable and reasonably efficient for
> > typical workloads,
> > the overhead required for managing the cache evictions in a broker
> > that is running
> > many topics can be pretty high in terms of extra CPU utilization and on the 
> > JVM
> > garbage collection to track an increased number of medium-lived objects.
> >
> > The goal is to provide an alternative implementation that can adapt better 
> > to
> > a wider variety of operating conditions.
> >
> > ### Current implementation details
> >
> > The broker cache is implemented as part of the `ManagedLedger` component,
> > which sits in the Pulsar broker and provides a higher level of
> > abstraction of top
> > of BookKeeper.
> >
> > Each topic (and managed-ledger) has its own private cache space. This
> > cache is implemented
> > as a `ConcurrentSkipList` sorted map that maps `(ledgerId, entryId) ->
> > payload`. The payload
> > is a `ByteBuf` reference that can either be a slice of a `ByteBuf` that we 
> > got
> > when reading from a socket, or it can be a copied buffer.
> >
> > Each topic cache is allowed to use the full broker max cache size before an
> > eviction is triggered. The total cache size is effectively a resource
> > shared across all
> > the topics, where a topic can use a more prominent portion of it if it
> > "asks for more".
> >
> > When the eviction happens, we need to do an expensive ranking of all
> > the caches in the broker
> > and do an eviction in a proportional way to the currently used space
> > for each of them.
> >
> > The bigger problem is represented by the `ConcurrentSkipList` and the
> > `ByteBuf` objects
> > that need to be tracked. The skip list is essentially like a "tree"
> > structure and needs to
> > maintain Java objects for each entry in the cache. We also need to
> > potentially have
> > a huge number of ByteBuf objects.
> >
> > A cache workload is typically the worst-case scenario for each garbage
> > collector implementation because it involves creating objects, storing
> > them for some amount of
> > time and then throwing them away. During that time, the GC would have
> > already tenured these
> > objects and copy them into an "old generation" space, and sometime
> > later, a costly compaction
> > of that memory would have to be performed.
> >
> > To mitigate the effect of the cache workload on the GC, we're being
> > very aggressive in
> > purging the cache by triggering time-based eviction. By putting a max
> > TTL on the elements in
> > the cache, we can avoid keeping the objects around for too long to be
> > a problem for the GC.
> >
> > The reverse side of this is that we're artificially reducing the cache
> > capacity to a very
> > short time frame, reducing the cache usefulness.
> >
> > The other problem is the CPU cost involved in doing these frequent
> > evictions, which can
> > be very high when there are 10s of thousands of topics in a broker.
> >
> >
> > ## Proposed changes
> >
> > Instead of dealing with individual caches for each topic, let's adopt
> > a model where
> > there is a single cache space for the broker.
> >
> > This cache is broken into N segments which act as a circular buffer.
> > Whenever a segment
> > is full, we start writing into the next one, and when we reach the
> > last one, we will
> > restart recycling the first segment.
> >
> > Each segment is composed of a buffer, an offset, and a hashmap which maps
> > `(ledgerId, entryId) -> offset`.
> >
> > This model has been working very well for the BookKeeper `ReadCache`:
> > https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/ReadCache.java
> >
> > There are two main advantages to this approach:
> >
> > 1. Entries are copied into the cache buffer (in direct memory), and
> > we don't need to keep any
> > long-lived Java objects around
> > 2. The eviction becomes a completely trivial operation, buffers are
> > just rotated and
> > overwritten. We don't need to do any per-topic task or keep track
> > of utilization.
> >
> > ### API changes
> >
> > No user-facing API changes are required.
> >
> > ### New configuration options
> >
> > The existing cache implementation will not be removed at this point. Users 
> > will
> > be able to configure the old implementation in `broker.conf`.
> >
> > This option will be helpful in case of performance regressions would be 
> > seen for
> > some use cases with the new cache implementation.

Re: [DISCUSS] PIP-174: Provide new implementation for broker dispatch cache

Reply via email to