[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16452907#comment-16452907
 ] 

Varun Thacker commented on SOLR-11881:
--------------------------------------

We ran into another such scenario where a SocketTimeoutException caused LIR.

The replica had this in it's logs
{code:java}
date time WARN [qtp768306356-580185] ? (:) - 
java.nio.channels.ReadPendingException: null
at org.eclipse.jetty.io.FillInterest.register(FillInterest.java:58) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.io.AbstractEndPoint.fillInterested(AbstractEndPoint.java:353) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.io.AbstractConnection.fillInterested(AbstractConnection.java:134)
 ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) 
~[jetty-server-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
 ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:289) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.io.ssl.SslConnection$3.succeeded(SslConnection.java:149) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247)
 ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)
 ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
 ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382)
 ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708)
 ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626) 
~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0-zing_17.11.0.0]

date time WARN [qtp768306356-580185] ? (:) - Read pending for 
org.eclipse.jetty.server.HttpConnection$BlockingReadCallback@2e98df28 prevented 
AC.ReadCB@424271f8{HttpConnection@424271f8[p=HttpParser{s=START,0 of 
-1},g=HttpGenerator@424273ae{s=START}]=>HttpChannelOverHttp@4242713d{r=141,c=false,a=IDLE,uri=null}<-DecryptedEndPoint@4242708d{/host:52824<->/host:port,OPEN,fill=FI,flush=-,to=1/86400}->HttpConnection@424271f8[p=HttpParser{s=START,0
 of -1},g=HttpGenerator@{code}
And the leader waited exactly the socket timeout period after this error and 
threw a socket-timeout-exception . At that point the leader put the replica 
into recovery 

> Connection Reset Causing LIR
> ----------------------------
>
>                 Key: SOLR-11881
>                 URL: https://issues.apache.org/jira/browse/SOLR-11881
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>            Assignee: Varun Thacker
>            Priority: Major
>         Attachments: SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:210)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
>         at sun.security.ssl.InputRecord.read(InputRecord.java:503)
>         at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
>         at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
>         at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
>         at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
>         at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
>         at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
>         at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
>         at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
>         at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
>         at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
>         at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
>         at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to