Hi there,

We have a cluster spread over 72 instances on k8s hosting around 12.5
billion documents (made up of 30 collections, each collection having 12
shards). We were originally using 7.7.2 and performance was okay enough for
us for our business needs. We then recently upgraded our cluster to
v8.11.2, and have noticed a drop in performance. I appreciate that there
have been a lot of changes from 7.7.2 to 8.11.2, but I have been collecting
metrics, and although the configuration (instance type and resource
allocation, start up opts) are the same, we are completely at a loss as to
why it's performing worse, and was wondering if anyone had any guidance?

I recently stumbled across the tickets;

   - SOLR-15840 <https://issues.apache.org/jira/browse/SOLR-15840> -
   Performance degradation with http2
   - SOLR-16099 <https://issues.apache.org/jira/browse/SOLR-16099> - HTTP
   Client threads can hang

In particular which sparked interest, and so we spun up a parallel cluster
with -Dsolr.http1=true, and there was no difference in performance. We're
testing a couple of other ideas, such as different DirectoryFatory *(as I
saw a message from someone in the Solr Slack about there being an issue
with the MMap directory and vm.max_map_count)*, some GC settings, but are
really open to any suggestions. We're also happy if it'll help with any
performance related topics to use this cluster to test patches at a large
scale to see if it'll help with performance *(more specifically to the two
Solr tickets listed above)*.

I thought it would be useful to show some metrics I collected where we had
2 clusters spun up, 1 being 7.7.2 and 1 being 8.11.2 where the 8.11.2
cluster was the active, and all traffic was being shadow loaded into the
7.7.2 cluster to compare against. It's important to note that both clusters
had the same configuration, here is a list to name a few:

   - G1GC garbage collector
   - TLOG replication
   - 27Gi Memory per instance
   - 16Gi assigned to -XmX and -Xms
   - 16 cores
   - -XX:G1HeapRegionSize=4m
   - -XX:G1ReservePercent=20
   - -XX:InitiatingHeapOccupancyPercent=35

One metric that did stand out, was that 8.11.2 was churning through *a lot* of
eden space in the heap, which can be seen in some of the screenshots of
metrics below;

Total Memory Usage:
7.7.2


8.11.2


Total Used G1 Pools
7.7.2


8.11.2


And finally, the overall thread pool
7.7.2


8.11.2


Any guidance or requests to test for performance wise would be appreciated.

Thanks,

Richard

Reply via email to