[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

Hoss Man (JIRA) Mon, 11 Jul 2016 15:17:23 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371758#comment-15371758
 ]


Hoss Man commented on SOLR-9290:
--------------------------------

questions specifically for [~shaie] followng up on comments made in the mailing 
list thread mentioned in the isue summary...

{quote}
When it does happen, the number of CLOSE_WAITS climb very high, to the order of 
30K+ entries in 'netstat'.
...
When I say it does not reproduce on 5.4.1 I really mean the numbers
don't go as high as they do in 5.5.1. Meaning, when running without
SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I
would separately like to understand why we have any in that state at
all). When running with SSL and 5.4.1, they stay low at the order of
hundreds the most.
{quote}

* Does this only reproduce in your application, with your customized configs of 
Solr, or can you reproduce it using something trivial like "modify 
bin/solr.in.sh to point at an SSL cert, then run; {{bin/solr -noprompt 
-cloud}}." ?
* Does the problem only manifest solely with indexing, or with queries as well? 
ie...
** assuming a pre-built collection, and then all nodes restarted, does 
hammering the cluster with read only queries manifest the problem?
** assuming a virgin cluster with no docs, does hammering the cluster w/updates 
but never any queries, manifest the problem?
* Assuming you start by bringing up a virgin cluster and then begin hammering 
it with whatever sequences of requests are needed to manifest the problem, how 
long do you have to wait before the number of CLOSE_WAITS spikes high enough 
that you are reasonably confident the problem has occured?

The last question being a pre-req to wondering if we can just git bisect to 
identify where/when the problem originated.  

Even if writing a (reliable) bash automation script (to start the cluster, 
_and_ triggering requests, _and_ monitoring the CLOSE_WAITS to see if they go 
over a specified threshold in under a specified timelimit, _and_ shut 
everything down cleanly) is too cumbersome to have faith in running an 
automated {{git bisect run test.sh}}, we could still consider doing some 
manually driven git bisection to try and track this down, as long as each 
iteration doesn't take very long.

Specifically: {{git merge-base}} says ffadf9715c4a511178183fc1411b18c1701b9f1d 
is the common ancestor for 5.4.1 and 5.5.1, and {{git log}} says there are 487 
commits between that point and the 5.5.1 tag.  Statistically speaking it should 
only take 
~10 iterations to do a binary search of those commits to find the first 
problematic one.

Assuming there is a manual process someone can run on a clean git checkout of 
5.4.1 that takes under 10 minutes to get from "ant clean server" to an obvious 
splke in CLOSE_WAITS, someone with some CPU cycles to spare who doesn't mind a 
lot of context switching while they do their day job could be...
# running a command to spin up the cluster & client hammering code
# setting a 10 minute timer
# when the timer goes off, check the results of another command to count the 
CLOSE_WAITS
# {{git bisect good/bad}}
# repeat
...and within ~2-3 hours should almost certainly have tracked down when/where 
the problem started.



> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-9290
>                 URL: https://issues.apache.org/jira/browse/SOLR-9290
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 5.5.1, 5.5.2
>            Reporter: Anshum Gupta
>            Priority: Critical
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%[email protected]%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

Reply via email to