TL;DR: Some limited testing suggests that documentCache adds overhead w/o
benefit. Are there any folks that can can report their usecases perform
significantly better with documentCache enabled?
Background...
I was asked to investigate a Solr 9.1 kubernetes deployment where a subset
of the Solr pods (hosting only PULL replicas) were configured to scale
up/down as needed using a kubernetes HPA (based on CPU load). The idea
being that as traffic was largely cyclical over the course of the day, thy
could save some compute cost by letting the system scale down pods when
load was lower, and scale back up (still using the same Persistent
Volumes) when load resumed the next day.
For the most part this works fine, but in some cases -- not all -- they
were seeing that as traffic ramped up, and the HPA (re)started pod that
had existing replicas on an existing Persistent Volume, a pod might
experience several minutes of p99 query response times that were abismal:
5+ seconds, compared to the typical 10-100ms. Even the p50 response
times would be in the 5+ seconds range for a several minutes!
When I started reproducing this behavior under controled load, what I
found is that when this was happening, thread dumps showed a lot of
threads blocked on trying to "read through" the documentCache via
CaffeineCache.computeIfAbsent.
100s of threads:
"qtp392918519-19" #19 prio=5 os_prio=0 cpu=2853.50ms elapsed=129.15s
tid=0x00007f04316f6cf0 nid=0x5e waiting on condition [0x00007f03f8e2f000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.7/Native Method)
- parking to wait for <0x00000004b1c6a6a8> (a
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at
java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.7/Unknown
Source)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.7/Unknown
Source)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(java.base@17.0.7/Unknown
Source)
at
java.util.concurrent.locks.ReentrantLock$Sync.tryLockNanos(java.base@17.0.7/Unknown
Source)
at
java.util.concurrent.locks.ReentrantLock.tryLock(java.base@17.0.7/Unknown
Source)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.lock(BoundedLocalCache.java:1510)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1492)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2212)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:2182)
at
com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.putIfAbsent(LocalAsyncCache.java:316)
at
com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.putIfAbsent(LocalAsyncCache.java:291)
at
org.apache.solr.search.CaffeineCache.computeAsync(CaffeineCache.java:209)
at
org.apache.solr.search.CaffeineCache.computeIfAbsent(CaffeineCache.java:250)
at
org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:259)
...seemingly blocked on one thread doing cleanup / eviction:
"qtp392918519-960" #960 prio=5 os_prio=0 cpu=141.89ms elapsed=32.24s
tid=0x00007f03f013ec30 nid=0x41f runnable [0x00007f01a946e000]
java.lang.Thread.State: RUNNABLE
at
com.github.benmanes.caffeine.cache.AccessOrderDeque.setPrevious(AccessOrderDeque.java:66)
at
com.github.benmanes.caffeine.cache.AccessOrderDeque.setPrevious(AccessOrderDeque.java:30)
at
com.github.benmanes.caffeine.cache.AbstractLinkedDeque.unlink(AbstractLinkedDeque.java:139)
at
com.github.benmanes.caffeine.cache.AccessOrderDeque.remove(AccessOrderDeque.java:53)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.evictFromWindow(BoundedLocalCache.java:694)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntries(BoundedLocalCache.java:671)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1634)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1602)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3620)
at
org.apache.solr.search.CaffeineCache$$Lambda$656/0x00000008015b2870.execute(Unknown
Source)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:1575)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleAfterWrite(BoundedLocalCache.java:1545)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1477)
at
com.github.benmanes.caffeine.cache.BoundedLocalCache.replace(BoundedLocalCache.java:2485)
at
com.github.benmanes.caffeine.cache.LocalAsyncCache.lambda$handleCompletion$7(LocalAsyncCache.java:216)
at
com.github.benmanes.caffeine.cache.LocalAsyncCache$$Lambda$681/0x00000008015df980.accept(Unknown
Source)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@17.0.7/Unknown
Source)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@17.0.7/Unknown
Source)
at
java.util.concurrent.CompletableFuture.postComplete(java.base@17.0.7/Unknown
Source)
at
java.util.concurrent.CompletableFuture.complete(java.base@17.0.7/Unknown Source)
at
org.apache.solr.search.CaffeineCache.computeAsync(CaffeineCache.java:235)
at
org.apache.solr.search.CaffeineCache.computeIfAbsent(CaffeineCache.java:250)
at
org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:259)
(NOTE: not all thread dumps i took included a RUNNABLE doing cleanup /
eviction -- it's possible there _was_ a thread doing this cleanup that had
already returned, and the waiting threads hadn't (all) had time to wake up
yet.)
I don't really understand the Caffeine cache internals (or how they are
used in solr.CaffeineCache) enough to fully understand *WHY* this is
happening -- let alone why it seemed to only happen right after pod
startup -- but based on some googling my two best theories are/were:
1) A bug and/or pathological behavior in the Caffeine cache (or in
solr.CaffeineCache) when the cache starts out "empty"
- I found some links (see [LINKS] below) that seemed to suggest this
might be possible, but...
* We seem to be following most of the best practices
(initialSize==size, using async, etc.)
* Since there is no autowarming of documentCache (and this collection
had no newSearcher event listene configured) that seems like it
should impact newSearcher situations just as much as firstSearcher
* But this deployment only ever saw problems on pod startup.
- I tried testing some changes to mitigate this:
* increases in size && initialSize
* explicit firstSearcher warming to pre-fill the cache with *ANY* docs
(in case the initialSize was being ignored)
...but it didn't change the outcome
2) Reading stored fields from a freshly mounted Persistent Volume (ie:
nothing in the OS filesystem cache) might be slow enough to cause problems
/ long locking in LocalAsyncCache
- This occured to me because of some Caffeine cache guidance I saw
that advised Callables passed to Caffeine cache population should
be "fast"
* I was left wondering if the IO needed to read stored fields
from disk was "fast enough"
* And it would explain why the problem only seemed to happen
on pod startup
- But skiming the code in solr.CaffeineCache, I don't see how this would
affect us:
* the only "blocking" operation should be:
- asyncCache.asMap().putIfAbsent(key, future);
* actaully computing the result of that future shouldn't block anyone
except threads waiting on that exact key->future
----
In any case: Completely disabling the documentCache made this problem go
away. Which was not a suprise -- if you stop using codepaths with locks
you'll stop seeing a backlog of threads waiting on that lock.
What *WAS* a surprise is that when I disabled the documentCache, there was
a tiny but consistent improvement in the mean, p50, p75, p95 & p99 Solr
response times (as measured by the application executing the HTTP request
to solr and reading the response back over the network).
But this improvement wasn't just in the "obvious" case of the HPA scaling
test (by eliminating the several minutes of huge outliers).
Even in "steady state" tests, where a fixed number of Solr pods were used,
and recieved a consistent, and modest, rate of requests that they could
handle w/o maxing out CPU, the measured (mean, p50, p75, p95 & p99) Solr
response times were still consistently a few milliseconds faster when the
documentCache was disabled then they were when the documentCache was
enabled.
In short, for this Solr deployment, the documentCache only seemed to have
downsides:
- Slightly slower overall response times
- Risk of major slowdowns when starting up new pods
Hence the question I posed on the Subject of this email: Does
documentCache still make sense in modern Solr?
Has anyone (else) done any (recent) performance testing of their Solr use
cases with and w/o documentCache enabled, and found that the documentCache
adds value? If not, should we consider changing the default configset to
mark it enabled="false", and let people enable it if and when they think
it might be useful?
[LINKS] related to write contention issues in Caffeine:
- https://github.com/ben-manes/caffeine/wiki/Faq#write-contention
- https://github.com/ben-manes/caffeine/issues/672
- cites Solr as a good example of avoiding these problems
- https://github.com/ben-manes/caffeine/issues/768
- https://github.com/ben-manes/caffeine/issues/527
- https://github.com/ben-manes/caffeine/issues/203
-Hoss
http://www.lucidworks.com/