Deadlock observed for distributed search in Solr 9.10.1

Jan Høydahl Wed, 18 Mar 2026 04:39:30 -0700

We recently upgraded some Solr clusters from version 9.7 to 9.10.1. Collection 
have multiple shards and run distributed requests continously. After a few 
days, distributed requests would start timing out and all clients would fail, 
requiring a full solr cluster restart to recover. No sign of overload. 
Downgrading back to Solr 9.7 fixed the issues. This has been observed in 
several different environments.


Have anyone else seen similar behavior in your own clusters?

As there is no errors in Solr logs, no exceptions, no high load or scary 
Grafana graphs in GC or otherwise, we have spent several days investigating and 
trying to reproduce, with limited luck.

The best I have is an LLM analysis of the issue and a theory of what might 
cause it. It think the analysis is interesting and the suspect is leaking 
semaphores in Http2SolrClient.AsyncTracker which would eventually cause a full 
stop.

The analysis is here https://cwiki.apache.org/confluence/x/AZM8G - it contains 
a description, executive summary, tech details and some questions for 
ocmmitters. You may comment inline in Confluence if you have an account, or 
here in this thread.

I have not yet filed a bug in JIRA, as I want to discuss here and still hope to 
reproduce the issue in a pristine environment.

Jan

Deadlock observed for distributed search in Solr 9.10.1

Reply via email to