[ 
https://issues.apache.org/jira/browse/SOLR-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068153#comment-18068153
 ] 

David Smiley commented on SOLR-18174:
-------------------------------------

This looks exactly like what SOLR-18051 solves!  At the very least, SOLR-18051 
should double the permits available if it doesn't inherently fix the bug at 
stake hear (it may not).

> AsyncTracker Semaphore leak on LBAsyncSolrClient retries
> --------------------------------------------------------
>
>                 Key: SOLR-18174
>                 URL: https://issues.apache.org/jira/browse/SOLR-18174
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrJ
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: threads-test-node-0.json, threads-test-node-1.json
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Experienced complete deadlocked Solr 9.10.1 distributed requests several 
> times in production, once every copule of days. A Solr restart resolved the 
> issue. This started happending immediately after upgrading from Solr 9.7 to 
> 9.10.
> I had Claude make an analysis of what could be happening, see 
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977] 
> . This identifies several code changes related to distributed search between 
> those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related 
> to changed behavior with cancelAll and request.abort during aborted or failed 
> queries, which could lead to a semaphore leak, at least temporarily for 10 
> min. While we could reproduce such a scenario, it would only be a temporary 
> leak as permits would be released after timeout.
> Later we were able to catch an internal test environment in the failure 
> state, and were able to make tread dumps for the two nodes in the cluster 
> (attached). Analyzing these with Claude identified another failure mode: 
> LBHttp2SolrClient has a retry logic if the first request fails, and it will 
> spawn a new async request which obtains another Semaphore permit, without 
> first releasing the permit obtained for the original query. Net result, if 
> available permits is already low, a permanent deadlock happens. I will attach 
> a PR reproducing this failure state, but it simulates a low number of permits 
> as a prerequisite.
> So the final piece of the puzzle is to demonstrate how Semaphore permits may 
> gradually leak over time to get to a state of low availability, which is a 
> prerequisite for the deadlock case described above. This is still TBD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to