Hi All,
We recently upgraded from Solr 6.6.6 & Java 8 to Solr 7.7.3 & Java 11 and
have started seeing a problem with replication failures leaving replicas in
an inconsistent state with no self correction mechanism.  The leader is
hitting a broken pipe SocketException like this:

ERROR org.apache.solr.update.SolrCmdDistributor - FROMLEADER request to
http://test-solr-8:8983/solr/instance_194563/ failed - retrying ...
retries: 1/3. add{,id=(null)}
params:update.chain=external-version-constraint&update.distrib=FROMLEADER&distrib.from=
http://test-solr-10.terravault.com:8983/solr/instance_194563/
rsp:-1:java.net.SocketException: Broken pipe (Write failed)
        at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
        at
java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
        at
java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
        at
org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:124)
        at
org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:136)
        at
org.apache.http.impl.io.SessionOutputBufferImpl.write(SessionOutputBufferImpl.java:167)
        at
org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:122)
        at
org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:179)
        at
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:216)
        at
org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:209)
        at
org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:169)
        at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:102)
        at
org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83)
        at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner$1.writeTo(ConcurrentUpdateSolrClient.java:266)
        at
org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:73)
        at
org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156)
        at
org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160)
        at
org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238)
        at
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
        at
org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120)
        at
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
        at
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
        at
org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:349)
        at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:183)
        at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
        at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
        at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
        Suppressed: java.net.SocketException: Broken pipe (Write failed)
                at
java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
                at
java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
                at
java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
                at
org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:124)
                at
org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:136)

                at
org.apache.http.impl.io.SessionOutputBufferImpl.write(SessionOutputBufferImpl.java:167)
                at
org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:122)
                at
org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:179)
                at
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:216)
                at
org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:209)
                at
org.apache.solr.common.util.JavaBinCodec.close(JavaBinCodec.java:1299)
                at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:103)
                ... 22 more

On Solr 6 we would see this on occasion but this was handled by the leader
forcing a recovery (LIR) of the replica but this is no longer happening.
After our upgrade to Java 11, this seems to be happening much more
frequently.... we're currently investigating upgrading to Java 17 to see if
this reduces the frequency of the broken pipe issue, but I'm pretty sure
that we will still hit this occasionally and the lack of LIR in this
scenario is concerning because it means that there is a scenario where
replicas can become out-of-sync and remain out of sync indefinitely.

Current Configuration:
Solr 7.7.3
JVM: Amazon Coretto 11.0.19.7.1
OS: Amazon Linux 2
SolrCloud with 2 replicas for every collection.  A single shard per
collection, we have a custom sharding implementation that is built on top
of Solr collection outside of Solr.

There seems to be multiple open issues that look related to this:
"Multiple flaws in tracking which UpdateCommand is associated with a given
failure logged by ErrorReportingConcurrentUpdateSolrClient:
"cmd=add{,id=(null)}": https://issues.apache.org/jira/browse/SOLR-14718
"ConcurrentUpdateSolrClient swallows exceptions":
https://issues.apache.org/jira/browse/SOLR-3284


Has anyone run into this similar type of issue?  Any known solutions?

Thanks,
Brian


-- 


*Brian Lininger*
Technical Architect, Infrastructure & Search
*Veeva Systems *
brian.linin...@veeva.com

*Zoom:* https://veeva.zoom.us/j/8113896271

www.veeva.com


*This email and the information it contains are intended for the intended
recipient only, are confidential and may be privileged information exempt
from disclosure by law.*
*If you have received this email in error, please notify us immediately by
reply email and delete this message from your computer.*
*Please do not retain, copy or distribute this email.*

Reply via email to