[
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464502#comment-16464502
]
Varun Thacker commented on SOLR-11881:
--------------------------------------
So recently I've been seeing this problem in this form:
- The replica get's a ReadPendingException from Jetty
{code:java}
date time WARN [qtp768306356-580185] ? (:) -
java.nio.channels.ReadPendingException: null
at org.eclipse.jetty.io.FillInterest.register(FillInterest.java:58)
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at
org.eclipse.jetty.io.AbstractEndPoint.fillInterested(AbstractEndPoint.java:353)
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]{code}
* The leader keeps waiting till socket timeout and then get's a socket timeout
exception and put's the replica into recovery
So I took Tomás latest patch and added SocketTimeoutException to the
{{isRetriableException}} check.
Q: What all exceptions should we retry on? Currently in the patch we have
SocketException / NoHttpResponseException
Once I added SocketTimeoutException as a retriable exception , I then set the
socket timeout to 100ms and sent updates to manually test if Solr's retrying
correctly . To my surprise I was never able to hit a socket timeout exception .
After some debugging here's why
In ConcurrentUpdateSolrClient we do this
{code:java}
org.apache.http.client.config.RequestConfig.Builder requestConfigBuilder =
HttpClientUtil.createDefaultRequestConfigBuilder();
if (soTimeout != null) {
requestConfigBuilder.setSocketTimeout(soTimeout);
}
if (connectionTimeout != null) {
requestConfigBuilder.setConnectTimeout(connectionTimeout);
}
method.setConfig(requestConfigBuilder.build());{code}
So createDefaultRequestConfigBuilder doesn't respect the timeout set in
solr.xml and uses a default of 60 seconds.
I debugged the code and if we simply remove these lines then the http-client
will use the default requestConfig which Solr creates with the settings
specified from the solr.xml file.
Mark : Do you remember the motivation for overriding the defaults from update
shard handlers httpclient and explicitly specifying a RequestConfig in CUSC?
Happy to track this in a separate Jira
> Connection Reset Causing LIR
> ----------------------------
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Varun Thacker
> Assignee: Varun Thacker
> Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch,
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX
> r:core_node56 collection_shardX_replicaY]
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy
> working SolrCloud cluster, even a rare response like this from a replica can
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET
> requests. Updates are POST requests
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since
> they have been versioned.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]