We recently upgraded some Solr clusters from version 9.7 to 9.10.1. Collection have multiple shards and run distributed requests continously. After a few days, distributed requests would start timing out and all clients would fail, requiring a full solr cluster restart to recover. No sign of overload. Downgrading back to Solr 9.7 fixed the issues. This has been observed in several different environments.
Have anyone else seen similar behavior in your own clusters? As there is no errors in Solr logs, no exceptions, no high load or scary Grafana graphs in GC or otherwise, we have spent several days investigating and trying to reproduce, with limited luck. The best I have is an LLM analysis of the issue and a theory of what might cause it. It think the analysis is interesting and the suspect is leaking semaphores in Http2SolrClient.AsyncTracker which would eventually cause a full stop. The analysis is here https://cwiki.apache.org/confluence/x/AZM8G - it contains a description, executive summary, tech details and some questions for ocmmitters. You may comment inline in Confluence if you have an account, or here in this thread. I have not yet filed a bug in JIRA, as I want to discuss here and still hope to reproduce the issue in a pristine environment. Jan
