janhoy opened a new pull request, #4236: URL: https://github.com/apache/solr/pull/4236
Relates to https://issues.apache.org/jira/browse/SOLR-18174 ## Background `HttpJettySolrClient` uses an internal `AsyncTracker` semaphore (default 1000 permits) to limit outstanding async HTTP requests. A permit is acquired in `onRequestQueued` and released in `onComplete`. Two separate bugs can cause permits to be consumed without being returned, eventually exhausting the semaphore. Once it reaches zero, every new distributed query blocks permanently on `semaphore.acquire()` — there is no timeout and no recovery short of a node restart. ## Two-phase failure model This PR aims to prove both phases through dedicated reproduction tests. So far we have only succeeded in proving the second phase, by simulating low available permits. What is shown as phase 1 in the diagram below is what we assume must be happening in real life, based on externally observed cluster behavior and clues in thread-dumps. <img width="1004" height="460" alt="AsyncTracker-deadlock" src="https://github.com/user-attachments/assets/9ad07ead-16ac-4038-8987-9d9300e826bc" /> ### Phase 1 — Gradual permit depletion (unproven hypothesis) **Test:** `testPermitLeakOnHttp2GoAwayDoubleQueuedListener` The thread dumps captured during the production deadlock show only 13 threads blocked on `semaphore.acquire()` — far fewer than the 1000 needed to exhaust the semaphore through Phase 2 alone. This means the semaphore must already have been near zero *before* the final burst event, implying a slow background depletion process had been running for some time prior. We have no direct observation of this drain — no metrics were in place to track available permits over time — so this is an inference, not a measured fact. We have no proof of what causes this slow depletion. Several hypotheses are on the table: - **Jetty HTTP/2 GOAWAY race:** when a request is dispatched to a connection already marked closed, `HttpConnectionOverHTTP2.send()` may return `SendFailure(ClosedChannelException, retry=true)`, causing `HttpDestination.process()` to re-enqueue the same exchange and fire `notifyQueued()` a second time. This would call `semaphore.acquire()` twice for a single logical request while `onComplete` fires only once — leaking one permit per event. This is the hypothesis the test targets. But since we run with `-Dsolr.http1=true` this finding should not be relevant to us. - **Idle connection expiry:** two of the 13 stuck threads in the reproduction dump show `HttpConnectionOverHTTP.onIdleExpired()` on the path leading to `LBAsyncSolrClient.doAsyncRequest()`. It is possible that idle connection eviction events trigger a code path that acquires a semaphore permit without a guaranteed corresponding release. Attempts to reproduce this in isolation have been inconclusive. - **Unknown cause:** there may be other conditions in the Jetty client or Solr's use of it that produce unmatched acquires, particularly under the connection churn typical of a Kubernetes environment. The test simulates the double `onRequestQueued` notification directly via reflection and asserts that the permit count is unchanged after a single `onComplete`. It **fails** without a `PERMIT_ACQUIRED_ATTR` idempotency guard — demonstrating that this specific pattern *would* leak permits if it occurs in practice. However, we have not confirmed that Jetty actually fires `notifyQueued` twice under real conditions in our cluster or unit tests trying to provoke this with a mock HTTP server. The guard and its TODO are included here to accompany the Phase 2 fix and to invite review of whether this path is reachable. Regardless of the true cause, the implication is the same: something slowly consumes permits in the background until the semaphore is nearly exhausted, at which point Phase 2 delivers the final blow. ### Phase 2 — Sudden full exhaustion (confirmed: re-entrant retry on the IO selector thread) **Test:** `testSemaphoreLeakOnLBRetry` When a connection-level failure occurs, Jetty fires `onFailure` on the IO selector thread. `HttpJettySolrClient.requestAsync` completes its `CompletableFuture` exceptionally from within that callback — still on the IO thread. `LBAsyncSolrClient.doAsyncRequest` registered a `whenComplete` on that future; because the future completes on the IO thread, `whenComplete` also fires **synchronously on the same IO thread**. The `whenComplete` action calls `doAsyncRequest` again as a retry to the next endpoint, which calls `semaphore.acquire()` — still on the IO thread — **before** the original request's `completeListener` has had a chance to call `semaphore.release()`. If the semaphore is already at zero (from Phase 1 depletion), `acquire()` **blocks the IO thread**. The blocked IO thread can no longer execute the `completeListener` that would release the original permit. The permit is permanently stuck, and the IO thread is permanently blocked. With N concurrent failures, N permits are lost in a single burst and N IO threads deadlock simultaneously. The test reproduces this with `MAX_PERMITS=40`: a fake TCP server accepts 40 connections (exhausting the semaphore), then RST-closes them all simultaneously. After 2 seconds, the test asserts that all 40 permits have been restored and all futures have completed. It **fails** with the current code. A companion test (`testNoPermitLeakOnLBRetryWithDefaultPermits`) confirms that with permits well above zero the IO thread never blocks and all permits are correctly returned. **Fix:** Complete the future off the IO thread by dispatching `completeExceptionally` to an executor, giving `onComplete` the opportunity to release the original permit before any `whenComplete` retry acquires a new one. ## Also included - `ASYNC_REQUESTS_MAX_SYSPROP` (`solr.http.client.async_requests.max`) makes the permit ceiling configurable, used in test but does not hurt to have configurable in production? - An OTel gauge (`solr.http.client.async_requests`) exposes `state=max` and `state=available` permit counts via `HttpShardHandlerFactory`, enabling monitoring of available permits. If we had this metric already, it would tell a lot about when/why the failure happens. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
