+1 - thanks for a great write up Matteo, it looks like it'll be a great improvement!
(Sorry for casting a late vote, but I figure it's worth affirming the PIP even if we've already closed the vote.) Thanks, Michael On Sun, Jul 24, 2022 at 9:32 PM Matteo Merli <matteo.me...@gmail.com> wrote: > > Closing this vote with 3 +1s and no -1s: > > +1s: > * Matteo > * PengHui > * Dave > > Thanks, > Matteo > > -- > Matteo Merli > <matteo.me...@gmail.com> > > On Thu, Jul 21, 2022 at 7:58 PM Dave Fisher <wave4d...@comcast.net> wrote: > > > > Sorry I’m late to this discussion. > > > > I think that the motivation is correct. There is really quite a bit of > > activity around this issue. Let’s take extra efforts to engage extra time > > with commits to confirm performance improvements. > > > > Let’s particularly pay attention to threading. > > > > +1 > > > > Regards, > > Dave > > > > Sent from my iPhone > > > > > On Jul 21, 2022, at 11:37 AM, Matteo Merli <mme...@apache.org> wrote: > > > > > > ## Motivation > > > > > > The current implementation of the read cache in the Pulsar broker has > > > largely > > > remained unchanged for a long time, except for a few minor tweaks. > > > > > > While the implementation is stable and reasonably efficient for > > > typical workloads, > > > the overhead required for managing the cache evictions in a broker > > > that is running > > > many topics can be pretty high in terms of extra CPU utilization and on > > > the JVM > > > garbage collection to track an increased number of medium-lived objects. > > > > > > The goal is to provide an alternative implementation that can adapt > > > better to > > > a wider variety of operating conditions. > > > > > > ### Current implementation details > > > > > > The broker cache is implemented as part of the `ManagedLedger` component, > > > which sits in the Pulsar broker and provides a higher level of > > > abstraction of top > > > of BookKeeper. > > > > > > Each topic (and managed-ledger) has its own private cache space. This > > > cache is implemented > > > as a `ConcurrentSkipList` sorted map that maps `(ledgerId, entryId) -> > > > payload`. The payload > > > is a `ByteBuf` reference that can either be a slice of a `ByteBuf` that > > > we got > > > when reading from a socket, or it can be a copied buffer. > > > > > > Each topic cache is allowed to use the full broker max cache size before > > > an > > > eviction is triggered. The total cache size is effectively a resource > > > shared across all > > > the topics, where a topic can use a more prominent portion of it if it > > > "asks for more". > > > > > > When the eviction happens, we need to do an expensive ranking of all > > > the caches in the broker > > > and do an eviction in a proportional way to the currently used space > > > for each of them. > > > > > > The bigger problem is represented by the `ConcurrentSkipList` and the > > > `ByteBuf` objects > > > that need to be tracked. The skip list is essentially like a "tree" > > > structure and needs to > > > maintain Java objects for each entry in the cache. We also need to > > > potentially have > > > a huge number of ByteBuf objects. > > > > > > A cache workload is typically the worst-case scenario for each garbage > > > collector implementation because it involves creating objects, storing > > > them for some amount of > > > time and then throwing them away. During that time, the GC would have > > > already tenured these > > > objects and copy them into an "old generation" space, and sometime > > > later, a costly compaction > > > of that memory would have to be performed. > > > > > > To mitigate the effect of the cache workload on the GC, we're being > > > very aggressive in > > > purging the cache by triggering time-based eviction. By putting a max > > > TTL on the elements in > > > the cache, we can avoid keeping the objects around for too long to be > > > a problem for the GC. > > > > > > The reverse side of this is that we're artificially reducing the cache > > > capacity to a very > > > short time frame, reducing the cache usefulness. > > > > > > The other problem is the CPU cost involved in doing these frequent > > > evictions, which can > > > be very high when there are 10s of thousands of topics in a broker. > > > > > > > > > ## Proposed changes > > > > > > Instead of dealing with individual caches for each topic, let's adopt > > > a model where > > > there is a single cache space for the broker. > > > > > > This cache is broken into N segments which act as a circular buffer. > > > Whenever a segment > > > is full, we start writing into the next one, and when we reach the > > > last one, we will > > > restart recycling the first segment. > > > > > > This model has been working very well for the BookKeeper `ReadCache`: > > > https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/ReadCache.java > > > > > > The eviction becomes a completely trivial operation, buffers are just > > > rotated and > > > overwritten. We don't need to do any per-topic task or keep track of > > > utilization. > > > > > > Today, there are 2 ways of configuring the cache, one that "copies" > > > data into the cache > > > and another that will just use reference-counting on the original > > > buffers to avoid > > > payload copies. > > > > > > ### Memory copies into the cache > > > > > > Each segment is composed of a buffer, an offset, and a hashmap which maps > > > `(ledgerId, entryId) -> offset`. > > > > > > > > > The advantage of this approach is that entries are copied into the cache > > > buffer > > > (in direct memory), and we don't need to keep any long-lived Java objects > > > around > > > > > > ### Keeping reference-counted buffers in the cache > > > > > > Each segment in the cache will contain a map `(ledgerId, entryId) -> > > > ByteBuf`. > > > Buffers will have an increase reference count that will keep the data > > > alive as long > > > as the buffer is in the cache and it will be released when the cache > > > segment is rotated. > > > > > > The advantage is we avoid any memory copy when inserting into or > > > reading from the cache. > > > The disadvantage is that we will have references to all the `ByteBuf` > > > objects that are in the cache. > > > > > > ### API changes > > > > > > No user-facing API changes are required. > > > > > > ### New configuration options > > > > > > The existing cache implementation will not be removed at this point. > > > Users will > > > be able to configure the old implementation in `broker.conf`. > > > > > > This option will be helpful in case of performance regressions would be > > > seen for > > > some use cases with the new cache implementation. > > > > > > > > > > > > -- > > > Matteo Merli > > > <mme...@apache.org> > >