Hi Richard, when you mention "In particular which sparked interest, and so we spun up a parallel cluster with -Dsolr.http1=true, and there was no difference in performance. ", do you mean that you still see the degradation in performance right?
I will probably state the obvious but normally you would require a detailed deep investigation to understand your issue. I suspect that without putting our hands on your cluster/config/architecture is going to be difficult to give meaningful suggestions. Especially with no reference to what you are currently using in Solr, e.g. do you see the degradation in: - indexing? indexing how? indexing what? The extent of the degradation - searching? what kind of queries? faceting? reranking?... That would definitely help but I suspect it's not going to be an easy one. Cheers -------------------------- *Alessandro Benedetti* Director @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr PMC Member* e-mail: a.benede...@sease.io *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Fri, 2 Dec 2022 at 13:15, Richard Goodman <richa...@brandwatch.com> wrote: > Hi Charlie, > > Gah, thanks for informing me of that, here is a link to the images is here > <https://imgur.com/a/yEmBGuv> > > Cheers, > > > On Tue, 29 Nov 2022 at 13:23, Charlie Hull < > ch...@opensourceconnections.com> > wrote: > > > Hey Richard, > > > > Attachments are stripped by this list so you might want to upload them > > somewhere and link to them. > > > > Cheers > > > > Charlie > > > > On 25/11/2022 17:33, Richard Goodman wrote: > > > Hi there, > > > > > > We have a cluster spread over 72 instances on k8s hosting around 12.5 > > > billion documents (made up of 30 collections, each collection having 12 > > > shards). We were originally using 7.7.2 and performance was okay enough > > for > > > us for our business needs. We then recently upgraded our cluster to > > > v8.11.2, and have noticed a drop in performance. I appreciate that > there > > > have been a lot of changes from 7.7.2 to 8.11.2, but I have been > > collecting > > > metrics, and although the configuration (instance type and resource > > > allocation, start up opts) are the same, we are completely at a loss as > > to > > > why it's performing worse, and was wondering if anyone had any > guidance? > > > > > > I recently stumbled across the tickets; > > > > > > - SOLR-15840<https://issues.apache.org/jira/browse/SOLR-15840> - > > > Performance degradation with http2 > > > - SOLR-16099<https://issues.apache.org/jira/browse/SOLR-16099> - > > HTTP > > > Client threads can hang > > > > > > In particular which sparked interest, and so we spun up a parallel > > cluster > > > with -Dsolr.http1=true, and there was no difference in performance. > We're > > > testing a couple of other ideas, such as different DirectoryFatory > *(as I > > > saw a message from someone in the Solr Slack about there being an issue > > > with the MMap directory and vm.max_map_count)*, some GC settings, but > are > > > really open to any suggestions. We're also happy if it'll help with any > > > performance related topics to use this cluster to test patches at a > large > > > scale to see if it'll help with performance *(more specifically to the > > two > > > Solr tickets listed above)*. > > > > > > I thought it would be useful to show some metrics I collected where we > > had > > > 2 clusters spun up, 1 being 7.7.2 and 1 being 8.11.2 where the 8.11.2 > > > cluster was the active, and all traffic was being shadow loaded into > the > > > 7.7.2 cluster to compare against. It's important to note that both > > clusters > > > had the same configuration, here is a list to name a few: > > > > > > - G1GC garbage collector > > > - TLOG replication > > > - 27Gi Memory per instance > > > - 16Gi assigned to -XmX and -Xms > > > - 16 cores > > > - -XX:G1HeapRegionSize=4m > > > - -XX:G1ReservePercent=20 > > > - -XX:InitiatingHeapOccupancyPercent=35 > > > > > > One metric that did stand out, was that 8.11.2 was churning through *a > > lot* of > > > eden space in the heap, which can be seen in some of the screenshots of > > > metrics below; > > > > > > Total Memory Usage: > > > 7.7.2 > > > > > > > > > 8.11.2 > > > > > > > > > Total Used G1 Pools > > > 7.7.2 > > > > > > > > > 8.11.2 > > > > > > > > > And finally, the overall thread pool > > > 7.7.2 > > > > > > > > > 8.11.2 > > > > > > > > > Any guidance or requests to test for performance wise would be > > appreciated. > > > > > > Thanks, > > > > > > Richard > > > > > -- > > Charlie Hull - Managing Consultant at OpenSource Connections Limited > > Founding member of The Search Network <http://www.thesearchnetwork.com> > > and co-author of Searching the Enterprise > > < > > > https://opensourceconnections.com/wp-content/uploads/2020/08/ES_book_final_journal_version.pdf > > > > > tel/fax: +44 (0)8700 118334 > > mobile: +44 (0)7767 825828 > > > > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin > > Amtsgericht Charlottenburg | HRB 230712 B > > Geschäftsführer: John M. Woodell | David E. Pugh > > Finanzamt: Berlin Finanzamt für Körperschaften II > > > > -- > > Richard Goodman (he/him) | Senior Data Infrastructure engineer > > richa...@brandwatch.com > > > NEW YORK | BOSTON | CHICAGO | TORONTO | *BRIGHTON* | > LONDON | COPENHAGEN | BERLIN | STUTTGART | FRANKFURT | > PARIS | BUDAPEST | SOFIA | CHENNAI | SINGAPORE | SYDNEY > | MELBOURNE >