janhoy opened a new pull request, #4236:
URL: https://github.com/apache/solr/pull/4236

   Relates to https://issues.apache.org/jira/browse/SOLR-18174
   
   ## Background
   
   `HttpJettySolrClient` uses an internal `AsyncTracker` semaphore (default 
1000 permits) to limit outstanding async HTTP requests. A permit is acquired in 
`onRequestQueued` and released in `onComplete`. Two separate bugs can cause 
permits to be consumed without being returned, eventually exhausting the 
semaphore. Once it reaches zero, every new distributed query blocks permanently 
on `semaphore.acquire()` — there is no timeout and no recovery short of a node 
restart.
   
   ## Two-phase failure model
   
   This PR aims to prove both phases through dedicated reproduction tests. So 
far we have only succeeded in proving the second phase, by simulating low 
available permits. What is shown as phase 1 in the diagram below is what we 
assume must be happening in real life, based on externally observed cluster 
behavior and clues in thread-dumps.
   
   <img width="1004" height="460" alt="AsyncTracker-deadlock" 
src="https://github.com/user-attachments/assets/9ad07ead-16ac-4038-8987-9d9300e826bc";
 />
   
   ### Phase 1 — Gradual permit depletion (unproven hypothesis)
   
   **Test:** `testPermitLeakOnHttp2GoAwayDoubleQueuedListener`
   
   The thread dumps captured during the production deadlock show only 13 
threads blocked on `semaphore.acquire()` — far fewer than the 1000 needed to 
exhaust the semaphore through Phase 2 alone. This means the semaphore must 
already have been near zero *before* the final burst event, implying a slow 
background depletion process had been running for some time prior. We have no 
direct observation of this drain — no metrics were in place to track available 
permits over time — so this is an inference, not a measured fact.
   
   We have no proof of what causes this slow depletion. Several hypotheses are 
on the table:
   
   - **Jetty HTTP/2 GOAWAY race:** when a request is dispatched to a connection 
already marked closed, `HttpConnectionOverHTTP2.send()` may return 
`SendFailure(ClosedChannelException, retry=true)`, causing 
`HttpDestination.process()` to re-enqueue the same exchange and fire 
`notifyQueued()` a second time. This would call `semaphore.acquire()` twice for 
a single logical request while `onComplete` fires only once — leaking one 
permit per event. This is the hypothesis the test targets. But since we run 
with `-Dsolr.http1=true` this finding should not be relevant to us.
   - **Idle connection expiry:** two of the 13 stuck threads in the 
reproduction dump show `HttpConnectionOverHTTP.onIdleExpired()` on the path 
leading to `LBAsyncSolrClient.doAsyncRequest()`. It is possible that idle 
connection eviction events trigger a code path that acquires a semaphore permit 
without a guaranteed corresponding release. Attempts to reproduce this in 
isolation have been inconclusive.
   - **Unknown cause:** there may be other conditions in the Jetty client or 
Solr's use of it that produce unmatched acquires, particularly under the 
connection churn typical of a Kubernetes environment.
   
   The test simulates the double `onRequestQueued` notification directly via 
reflection and asserts that the permit count is unchanged after a single 
`onComplete`. It **fails** without a `PERMIT_ACQUIRED_ATTR` idempotency guard — 
demonstrating that this specific pattern *would* leak permits if it occurs in 
practice. However, we have not confirmed that Jetty actually fires 
`notifyQueued` twice under real conditions in our cluster or unit tests trying 
to provoke this with a mock HTTP server. The guard and its TODO are included 
here to accompany the Phase 2 fix and to invite review of whether this path is 
reachable.
   
   Regardless of the true cause, the implication is the same: something slowly 
consumes permits in the background until the semaphore is nearly exhausted, at 
which point Phase 2 delivers the final blow.
   
   ### Phase 2 — Sudden full exhaustion (confirmed: re-entrant retry on the IO 
selector thread)
   
   **Test:** `testSemaphoreLeakOnLBRetry`
   
   When a connection-level failure occurs, Jetty fires `onFailure` on the IO 
selector thread. `HttpJettySolrClient.requestAsync` completes its 
`CompletableFuture` exceptionally from within that callback — still on the IO 
thread. `LBAsyncSolrClient.doAsyncRequest` registered a `whenComplete` on that 
future; because the future completes on the IO thread, `whenComplete` also 
fires **synchronously on the same IO thread**. The `whenComplete` action calls 
`doAsyncRequest` again as a retry to the next endpoint, which calls 
`semaphore.acquire()` — still on the IO thread — **before** the original 
request's `completeListener` has had a chance to call `semaphore.release()`.
   
   If the semaphore is already at zero (from Phase 1 depletion), `acquire()` 
**blocks the IO thread**. The blocked IO thread can no longer execute the 
`completeListener` that would release the original permit. The permit is 
permanently stuck, and the IO thread is permanently blocked. With N concurrent 
failures, N permits are lost in a single burst and N IO threads deadlock 
simultaneously.
   
   The test reproduces this with `MAX_PERMITS=40`: a fake TCP server accepts 40 
connections (exhausting the semaphore), then RST-closes them all 
simultaneously. After 2 seconds, the test asserts that all 40 permits have been 
restored and all futures have completed. It **fails** with the current code. A 
companion test (`testNoPermitLeakOnLBRetryWithDefaultPermits`) confirms that 
with permits well above zero the IO thread never blocks and all permits are 
correctly returned.
   
   **Fix:** Complete the future off the IO thread by dispatching 
`completeExceptionally` to an executor, giving `onComplete` the opportunity to 
release the original permit before any `whenComplete` retry acquires a new one.
   
   ## Also included
   
   - `ASYNC_REQUESTS_MAX_SYSPROP` (`solr.http.client.async_requests.max`) makes 
the permit ceiling configurable, used in test but does not hurt to have 
configurable in production?
   - An OTel gauge (`solr.http.client.async_requests`) exposes `state=max` and 
`state=available` permit counts via `HttpShardHandlerFactory`, enabling 
monitoring of available permits. If we had this metric already, it would tell a 
lot about when/why the failure happens.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to