[ 
https://issues.apache.org/jira/browse/SOLR-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-18174:
-------------------------------
    Attachment: threads-test-node-1.json
                threads-test-node-0.json

> AsyncTracker Semaphore leak on LBAsyncSolrClient retries
> --------------------------------------------------------
>
>                 Key: SOLR-18174
>                 URL: https://issues.apache.org/jira/browse/SOLR-18174
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrJ
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Major
>         Attachments: threads-test-node-0.json, threads-test-node-1.json
>
>
> Experienced complete deadlocked Solr 9.10.1 distributed requests several 
> times in production, once every copule of days. A Solr restart resolved the 
> issue. This started happending immediately after upgrading from Solr 9.7 to 
> 9.10.
> I had Claude make an analysis of what could be happening, see 
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977] 
> . This identifies several code changes related to distributed search between 
> those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related 
> to changed behavior with cancelAll and request.abort during aborted or failed 
> queries, which could lead to a semaphore leak, at least temporarily for 10 
> min.
> Later we were able to catch an internal test environment in the failure 
> state, and were able to make tread dumps for the two nodes in the cluster 
> (attached). Analyzing these with Claude identified another failure mode: 
> LBHttp2SolrClient has a retry logic if the first request fails, and it will 
> spawn a new request which obtains another Semaphore permit, without first 
> releasing the permit obtained for the original query. Net result is that the 
> original permit is leaked. A description of this failure scenario will be 
> presented in a Pull Request which also shows reproduction and a fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to