[
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371758#comment-15371758
]
Hoss Man commented on SOLR-9290:
--------------------------------
questions specifically for [~shaie] followng up on comments made in the mailing
list thread mentioned in the isue summary...
{quote}
When it does happen, the number of CLOSE_WAITS climb very high, to the order of
30K+ entries in 'netstat'.
...
When I say it does not reproduce on 5.4.1 I really mean the numbers
don't go as high as they do in 5.5.1. Meaning, when running without
SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I
would separately like to understand why we have any in that state at
all). When running with SSL and 5.4.1, they stay low at the order of
hundreds the most.
{quote}
* Does this only reproduce in your application, with your customized configs of
Solr, or can you reproduce it using something trivial like "modify
bin/solr.in.sh to point at an SSL cert, then run; {{bin/solr -noprompt
-cloud}}." ?
* Does the problem only manifest solely with indexing, or with queries as well?
ie...
** assuming a pre-built collection, and then all nodes restarted, does
hammering the cluster with read only queries manifest the problem?
** assuming a virgin cluster with no docs, does hammering the cluster w/updates
but never any queries, manifest the problem?
* Assuming you start by bringing up a virgin cluster and then begin hammering
it with whatever sequences of requests are needed to manifest the problem, how
long do you have to wait before the number of CLOSE_WAITS spikes high enough
that you are reasonably confident the problem has occured?
The last question being a pre-req to wondering if we can just git bisect to
identify where/when the problem originated.
Even if writing a (reliable) bash automation script (to start the cluster,
_and_ triggering requests, _and_ monitoring the CLOSE_WAITS to see if they go
over a specified threshold in under a specified timelimit, _and_ shut
everything down cleanly) is too cumbersome to have faith in running an
automated {{git bisect run test.sh}}, we could still consider doing some
manually driven git bisection to try and track this down, as long as each
iteration doesn't take very long.
Specifically: {{git merge-base}} says ffadf9715c4a511178183fc1411b18c1701b9f1d
is the common ancestor for 5.4.1 and 5.5.1, and {{git log}} says there are 487
commits between that point and the 5.5.1 tag. Statistically speaking it should
only take
~10 iterations to do a binary search of those commits to find the first
problematic one.
Assuming there is a manual process someone can run on a clean git checkout of
5.4.1 that takes under 10 minutes to get from "ant clean server" to an obvious
splke in CLOSE_WAITS, someone with some CPU cycles to spare who doesn't mind a
lot of context switching while they do their day job could be...
# running a command to spin up the cluster & client hammering code
# setting a 10 minute timer
# when the timer goes off, check the results of another command to count the
CLOSE_WAITS
# {{git bisect good/bad}}
# repeat
...and within ~2-3 hours should almost certainly have tracked down when/where
the problem started.
> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> ------------------------------------------------------------------------------
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 5.5.1, 5.5.2
> Reporter: Anshum Gupta
> Priority: Critical
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT
> state.
> At my workplace, we have seen this issue only with 5.5.1 and could not
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of
> users with 5.3.1 running into this issue too.
> Here's an excerpt from the email [~shaie] sent to the mailing list (about
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%[email protected]%3E
> Creating this issue so we could track this and have more people comment on
> what they see.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]