[
https://issues.apache.org/jira/browse/SOLR-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068221#comment-18068221
]
Jan Høydahl commented on SOLR-18174:
------------------------------------
Interesting, David. Yea, that bug effectively reduces the 1000 permits to 500,
which is not a root cause but accelerates the deadlock.
> AsyncTracker Semaphore leak on LBAsyncSolrClient retries
> --------------------------------------------------------
>
> Key: SOLR-18174
> URL: https://issues.apache.org/jira/browse/SOLR-18174
> Project: Solr
> Issue Type: Bug
> Components: SolrJ
> Reporter: Jan Høydahl
> Assignee: Jan Høydahl
> Priority: Major
> Labels: pull-request-available
> Attachments: threads-test-node-0.json, threads-test-node-1.json
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Experienced complete deadlocked Solr 9.10.1 distributed requests several
> times in production, once every copule of days. A Solr restart resolved the
> issue. This started happending immediately after upgrading from Solr 9.7 to
> 9.10.
> I had Claude make an analysis of what could be happening, see
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622977]
> . This identifies several code changes related to distributed search between
> those versions and involves jiras SOLR-17819, SOLR-17792, SOLR-17776 related
> to changed behavior with cancelAll and request.abort during aborted or failed
> queries, which could lead to a semaphore leak, at least temporarily for 10
> min. While we could reproduce such a scenario, it would only be a temporary
> leak as permits would be released after timeout.
> Later we were able to catch an internal test environment in the failure
> state, and were able to make tread dumps for the two nodes in the cluster
> (attached). Analyzing these with Claude identified another failure mode:
> LBHttp2SolrClient has a retry logic if the first request fails, and it will
> spawn a new async request which obtains another Semaphore permit, without
> first releasing the permit obtained for the original query. Net result, if
> available permits is already low, a permanent deadlock happens. I will attach
> a PR reproducing this failure state, but it simulates a low number of permits
> as a prerequisite.
> So the final piece of the puzzle is to demonstrate how Semaphore permits may
> gradually leak over time to get to a state of low availability, which is a
> prerequisite for the deadlock case described above. This is still TBD.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]