[
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373452#comment-15373452
]
Shalin Shekhar Mangar commented on SOLR-9290:
---------------------------------------------
It is reproducible very easily on stock solr with SSL enabled. My test setup
creates two SSL-enabled Solr instances with a 5 shard x 2 replica collection
and runs a short indexing program (just 9 update requests with 1 document each
and a commit at the end). Keep on running the indexing program repeatedly and
the number of connections in the CLOSE_WAIT state gradually increase.
Interestingly, the number of connections stuck in CLOSE_WAIT decrease during
indexing and increase again about 10 or so seconds after the indexing is
stopped.
I can reproduce the problem on 6.1, 6.0, 5.5.1, 5.3.2. I am not able to
reproduce this on master although I don't see anything relevant that has
changed since 6.1 -- I tried this only once so it may have just been bad timing?
When the connections show in CLOSE_WAIT state, the recv-q buffer always has
exactly 70 bytes.
{code}
netstat -tonp | grep CLOSE_WAIT | grep java
tcp 70 0 127.0.0.1:56538 127.0.1.1:8983 CLOSE_WAIT
21654/java off (0.00/0/0)
tcp 70 0 127.0.0.1:47995 127.0.1.1:8984 CLOSE_WAIT
21654/java off (0.00/0/0)
tcp 70 0 127.0.0.1:47477 127.0.1.1:8984 CLOSE_WAIT
21654/java off (0.00/0/0)
tcp 70 0 127.0.0.1:47996 127.0.1.1:8984 CLOSE_WAIT
21654/java off (0.00/0/0)
tcp 70 0 127.0.0.1:56644 127.0.1.1:8983 CLOSE_WAIT
21654/java off (0.00/0/0)
tcp 70 0 127.0.0.1:56533 127.0.1.1:8983 CLOSE_WAIT
21654/java off (0.00/0/0)
...
{code}
If I run the same steps with SSL disabled then the connections in CLOSE_WAIT
state have just 1 byte in recv-q. I don't see the number of such connections
increasing with indexing over time but I know for a fact (from a client) that
eventually more and more connections pile up in this state even without SSL.
{code}
tcp 1 0 127.0.0.1:41723 127.0.1.1:8983 CLOSE_WAIT
2522/java off (0.00/0/0)
tcp 1 0 127.0.0.1:41780 127.0.1.1:8983 CLOSE_WAIT
2640/java off (0.00/0/0)
...
{code}
I enabled debug logging for PoolingHttpClientConnectionManager (used in 6.x)
and PoolingClientConnectionManager (used in 5.x.x) and after running the
indexing program and verifying that some connections are in CLOSE_WAIT, I
grepped the logs for connections leased vs released and I always find the
number to be the same which means that the connections are always given back to
the pool.
Now some connections hanging around in CLOSE_WAIT are to be expected because of
the following (quoted from the httpclient documentation):
{quote}
One of the major shortcomings of the classic blocking I/O model is that the
network socket can react to I/O events only when blocked in an I/O operation.
When a connection is released back to the manager, it can be kept alive however
it is unable to monitor the status of the socket and react to any I/O events.
If the connection gets closed on the server side, the client side connection is
unable to detect the change in the connection state (and react appropriately by
closing the socket on its end).
HttpClient tries to mitigate the problem by testing whether the connection is
'stale', that is no longer valid because it was closed on the server side,
prior to using the connection for executing an HTTP request. The stale
connection check is not 100% reliable. The only feasible solution that does not
involve a one thread per socket model for idle connections is a dedicated
monitor thread used to evict connections that are considered expired due to a
long period of inactivity. The monitor thread can periodically call
ClientConnectionManager#closeExpiredConnections() method to close all expired
connections and evict closed connections from the pool. It can also optionally
call ClientConnectionManager#closeIdleConnections() method to close all
connections that have been idle over a given period of time.
{quote}
I'm going to try adding such a monitor thread and see if this is still a
problem.
> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> ------------------------------------------------------------------------------
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 5.5.1, 5.5.2
> Reporter: Anshum Gupta
> Priority: Critical
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT
> state.
> At my workplace, we have seen this issue only with 5.5.1 and could not
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of
> users with 5.3.1 running into this issue too.
> Here's an excerpt from the email [~shaie] sent to the mailing list (about
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%[email protected]%3E
> Creating this issue so we could track this and have more people comment on
> what they see.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]