Hi there, We have a cluster spread over 72 instances on k8s hosting around 12.5 billion documents (made up of 30 collections, each collection having 12 shards). We were originally using 7.7.2 and performance was okay enough for us for our business needs. We then recently upgraded our cluster to v8.11.2, and have noticed a drop in performance. I appreciate that there have been a lot of changes from 7.7.2 to 8.11.2, but I have been collecting metrics, and although the configuration (instance type and resource allocation, start up opts) are the same, we are completely at a loss as to why it's performing worse, and was wondering if anyone had any guidance?
I recently stumbled across the tickets; - SOLR-15840 <https://issues.apache.org/jira/browse/SOLR-15840> - Performance degradation with http2 - SOLR-16099 <https://issues.apache.org/jira/browse/SOLR-16099> - HTTP Client threads can hang In particular which sparked interest, and so we spun up a parallel cluster with -Dsolr.http1=true, and there was no difference in performance. We're testing a couple of other ideas, such as different DirectoryFatory *(as I saw a message from someone in the Solr Slack about there being an issue with the MMap directory and vm.max_map_count)*, some GC settings, but are really open to any suggestions. We're also happy if it'll help with any performance related topics to use this cluster to test patches at a large scale to see if it'll help with performance *(more specifically to the two Solr tickets listed above)*. I thought it would be useful to show some metrics I collected where we had 2 clusters spun up, 1 being 7.7.2 and 1 being 8.11.2 where the 8.11.2 cluster was the active, and all traffic was being shadow loaded into the 7.7.2 cluster to compare against. It's important to note that both clusters had the same configuration, here is a list to name a few: - G1GC garbage collector - TLOG replication - 27Gi Memory per instance - 16Gi assigned to -XmX and -Xms - 16 cores - -XX:G1HeapRegionSize=4m - -XX:G1ReservePercent=20 - -XX:InitiatingHeapOccupancyPercent=35 One metric that did stand out, was that 8.11.2 was churning through *a lot* of eden space in the heap, which can be seen in some of the screenshots of metrics below; Total Memory Usage: 7.7.2 8.11.2 Total Used G1 Pools 7.7.2 8.11.2 And finally, the overall thread pool 7.7.2 8.11.2 Any guidance or requests to test for performance wise would be appreciated. Thanks, Richard