Hi All, We recently upgraded from Solr 6.6.6 & Java 8 to Solr 7.7.3 & Java 11 and have started seeing a problem with replication failures leaving replicas in an inconsistent state with no self correction mechanism. The leader is hitting a broken pipe SocketException like this:
ERROR org.apache.solr.update.SolrCmdDistributor - FROMLEADER request to http://test-solr-8:8983/solr/instance_194563/ failed - retrying ... retries: 1/3. add{,id=(null)} params:update.chain=external-version-constraint&update.distrib=FROMLEADER&distrib.from= http://test-solr-10.terravault.com:8983/solr/instance_194563/ rsp:-1:java.net.SocketException: Broken pipe (Write failed) at java.base/java.net.SocketOutputStream.socketWrite0(Native Method) at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110) at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150) at org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:124) at org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:136) at org.apache.http.impl.io.SessionOutputBufferImpl.write(SessionOutputBufferImpl.java:167) at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:122) at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:179) at org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:216) at org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:209) at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:169) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:102) at org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner$1.writeTo(ConcurrentUpdateSolrClient.java:266) at org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:73) at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156) at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160) at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) at org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:349) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:183) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Suppressed: java.net.SocketException: Broken pipe (Write failed) at java.base/java.net.SocketOutputStream.socketWrite0(Native Method) at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110) at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150) at org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:124) at org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:136) at org.apache.http.impl.io.SessionOutputBufferImpl.write(SessionOutputBufferImpl.java:167) at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:122) at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:179) at org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:216) at org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:209) at org.apache.solr.common.util.JavaBinCodec.close(JavaBinCodec.java:1299) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:103) ... 22 more On Solr 6 we would see this on occasion but this was handled by the leader forcing a recovery (LIR) of the replica but this is no longer happening. After our upgrade to Java 11, this seems to be happening much more frequently.... we're currently investigating upgrading to Java 17 to see if this reduces the frequency of the broken pipe issue, but I'm pretty sure that we will still hit this occasionally and the lack of LIR in this scenario is concerning because it means that there is a scenario where replicas can become out-of-sync and remain out of sync indefinitely. Current Configuration: Solr 7.7.3 JVM: Amazon Coretto 11.0.19.7.1 OS: Amazon Linux 2 SolrCloud with 2 replicas for every collection. A single shard per collection, we have a custom sharding implementation that is built on top of Solr collection outside of Solr. There seems to be multiple open issues that look related to this: "Multiple flaws in tracking which UpdateCommand is associated with a given failure logged by ErrorReportingConcurrentUpdateSolrClient: "cmd=add{,id=(null)}": https://issues.apache.org/jira/browse/SOLR-14718 "ConcurrentUpdateSolrClient swallows exceptions": https://issues.apache.org/jira/browse/SOLR-3284 Has anyone run into this similar type of issue? Any known solutions? Thanks, Brian -- *Brian Lininger* Technical Architect, Infrastructure & Search *Veeva Systems * brian.linin...@veeva.com *Zoom:* https://veeva.zoom.us/j/8113896271 www.veeva.com *This email and the information it contains are intended for the intended recipient only, are confidential and may be privileged information exempt from disclosure by law.* *If you have received this email in error, please notify us immediately by reply email and delete this message from your computer.* *Please do not retain, copy or distribute this email.*