Does documentCache still make sense in modern Solr?

Chris Hostetter Tue, 06 Feb 2024 13:13:22 -0800

TL;DR: Some limited testing suggests that documentCache adds overhead w/obenefit. Are there any folks that can can report their usecases performsignificantly better with documentCache enabled?



Background...

I was asked to investigate a Solr 9.1 kubernetes deployment where a subsetof the Solr pods (hosting only PULL replicas) were configured to scaleup/down as needed using a kubernetes HPA (based on CPU load). The ideabeing that as traffic was largely cyclical over the course of the day, thycould save some compute cost by letting the system scale down pods whenload was lower, and scale back up (still using the same PersistentVolumes) when load resumed the next day.

For the most part this works fine, but in some cases -- not all -- theywere seeing that as traffic ramped up, and the HPA (re)started pod thathad existing replicas on an existing Persistent Volume, a pod mightexperience several minutes of p99 query response times that were abismal:5+ seconds, compared to the typical 10-100ms. Even the p50 responsetimes would be in the 5+ seconds range for a several minutes!

When I started reproducing this behavior under controled load, what Ifound is that when this was happening, thread dumps showed a lot ofthreads blocked on trying to "read through" the documentCache viaCaffeineCache.computeIfAbsent.



100s of threads:

"qtp392918519-19" #19 prio=5 os_prio=0 cpu=2853.50ms elapsed=129.15s 
tid=0x00007f04316f6cf0 nid=0x5e waiting on condition  [0x00007f03f8e2f000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.7/Native Method)
        - parking to wait for  <0x00000004b1c6a6a8> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.7/Unknown 
Source)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.7/Unknown
 Source)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(java.base@17.0.7/Unknown
 Source)
        at 
java.util.concurrent.locks.ReentrantLock$Sync.tryLockNanos(java.base@17.0.7/Unknown
 Source)
        at 
java.util.concurrent.locks.ReentrantLock.tryLock(java.base@17.0.7/Unknown 
Source)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.lock(BoundedLocalCache.java:1510)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1492)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2212)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:2182)
        at 
com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.putIfAbsent(LocalAsyncCache.java:316)
        at 
com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.putIfAbsent(LocalAsyncCache.java:291)
        at 
org.apache.solr.search.CaffeineCache.computeAsync(CaffeineCache.java:209)
        at 
org.apache.solr.search.CaffeineCache.computeIfAbsent(CaffeineCache.java:250)
        at 
org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:259)

...seemingly blocked on one thread doing cleanup / eviction:

"qtp392918519-960" #960 prio=5 os_prio=0 cpu=141.89ms elapsed=32.24s 
tid=0x00007f03f013ec30 nid=0x41f runnable  [0x00007f01a946e000]
   java.lang.Thread.State: RUNNABLE
        at 
com.github.benmanes.caffeine.cache.AccessOrderDeque.setPrevious(AccessOrderDeque.java:66)
        at 
com.github.benmanes.caffeine.cache.AccessOrderDeque.setPrevious(AccessOrderDeque.java:30)
        at 
com.github.benmanes.caffeine.cache.AbstractLinkedDeque.unlink(AbstractLinkedDeque.java:139)
        at 
com.github.benmanes.caffeine.cache.AccessOrderDeque.remove(AccessOrderDeque.java:53)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.evictFromWindow(BoundedLocalCache.java:694)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntries(BoundedLocalCache.java:671)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1634)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1602)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3620)
        at 
org.apache.solr.search.CaffeineCache$$Lambda$656/0x00000008015b2870.execute(Unknown
 Source)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:1575)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleAfterWrite(BoundedLocalCache.java:1545)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1477)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.replace(BoundedLocalCache.java:2485)
        at 
com.github.benmanes.caffeine.cache.LocalAsyncCache.lambda$handleCompletion$7(LocalAsyncCache.java:216)
        at 
com.github.benmanes.caffeine.cache.LocalAsyncCache$$Lambda$681/0x00000008015df980.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@17.0.7/Unknown 
Source)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@17.0.7/Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.postComplete(java.base@17.0.7/Unknown 
Source)
        at 
java.util.concurrent.CompletableFuture.complete(java.base@17.0.7/Unknown Source)
        at 
org.apache.solr.search.CaffeineCache.computeAsync(CaffeineCache.java:235)
        at 
org.apache.solr.search.CaffeineCache.computeIfAbsent(CaffeineCache.java:250)
        at 
org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:259)

(NOTE: not all thread dumps i took included a RUNNABLE doing cleanup /eviction -- it's possible there _was_ a thread doing this cleanup that hadalready returned, and the waiting threads hadn't (all) had time to wake upyet.)

I don't really understand the Caffeine cache internals (or how they areused in solr.CaffeineCache) enough to fully understand *WHY* this ishappening -- let alone why it seemed to only happen right after podstartup -- but based on some googling my two best theories are/were:

1) A bug and/or pathological behavior in the Caffeine cache (or insolr.CaffeineCache) when the cache starts out "empty"


 - I found some links (see [LINKS] below) that seemed to suggest this
   might be possible, but...
   * We seem to be following most of the best practices
      (initialSize==size, using async, etc.)
   * Since there is no autowarming of documentCache (and this collection
     had no newSearcher event listene configured) that seems like it
     should impact newSearcher situations just as much as firstSearcher
   * But this deployment only ever saw problems on pod startup.
 - I tried testing some changes to mitigate this:
   * increases in size && initialSize
   * explicit firstSearcher warming to pre-fill the cache with *ANY* docs
     (in case the initialSize was being ignored)
   ...but it didn't change the outcome

2) Reading stored fields from a freshly mounted Persistent Volume (ie:nothing in the OS filesystem cache) might be slow enough to cause problems/ long locking in LocalAsyncCache


 - This occured to me because of some Caffeine cache guidance I saw
   that advised Callables passed to Caffeine cache population should
   be "fast"
   * I was left wondering if the IO needed to read stored fields
     from disk was "fast enough"
   * And it would explain why the problem only seemed to happen
     on pod startup
 - But skiming the code in solr.CaffeineCache, I don't see how this would
   affect us:
   * the only "blocking" operation should be:
     - asyncCache.asMap().putIfAbsent(key, future);
   * actaully computing the result of that future shouldn't block anyone
     except threads waiting on that exact key->future

----

In any case: Completely disabling the documentCache made this problem goaway. Which was not a suprise -- if you stop using codepaths with locksyou'll stop seeing a backlog of threads waiting on that lock.

What *WAS* a surprise is that when I disabled the documentCache, there wasa tiny but consistent improvement in the mean, p50, p75, p95 & p99 Solrresponse times (as measured by the application executing the HTTP requestto solr and reading the response back over the network).

But this improvement wasn't just in the "obvious" case of the HPA scalingtest (by eliminating the several minutes of huge outliers).

Even in "steady state" tests, where a fixed number of Solr pods were used,and recieved a consistent, and modest, rate of requests that they couldhandle w/o maxing out CPU, the measured (mean, p50, p75, p95 & p99) Solrresponse times were still consistently a few milliseconds faster when thedocumentCache was disabled then they were when the documentCache wasenabled.

In short, for this Solr deployment, the documentCache only seemed to havedownsides:


 - Slightly slower overall response times
 - Risk of major slowdowns when starting up new pods

Hence the question I posed on the Subject of this email: DoesdocumentCache still make sense in modern Solr?

Has anyone (else) done any (recent) performance testing of their Solr usecases with and w/o documentCache enabled, and found that the documentCacheadds value? If not, should we consider changing the default configset tomark it enabled="false", and let people enable it if and when they thinkit might be useful?





[LINKS] related to write contention issues in Caffeine:

 - https://github.com/ben-manes/caffeine/wiki/Faq#write-contention
 - https://github.com/ben-manes/caffeine/issues/672
   - cites Solr as a good example of avoiding these problems
 - https://github.com/ben-manes/caffeine/issues/768
 - https://github.com/ben-manes/caffeine/issues/527
 - https://github.com/ben-manes/caffeine/issues/203





-Hoss
http://www.lucidworks.com/

Does documentCache still make sense in modern Solr?

Reply via email to