When running a bulk index process occasionally we see a NoHttpResponseException error when the leader is forwarding docs to the replica. I think this is a known issue and can be reproduced pretty easily.
What makes me want to dig more is that because of one such NoHttpResponseException the leader will put the replica into recovery. The replica can never catch up because the indexing throughput is quite high . This can add hours of recovery time for the replica depending on how many documents one is indexing . So from what I can think we have two options here - 1. Implement a thread which removes stale connections. This has been discussed on https://issues.apache.org/jira/browse/SOLR-4509 in the past 2. The above solution is not the right way forward. The main problem here is that replicas can't catch up because Solr doesn't implement backpressure yet and implementing that would be the correct solution here Does anyone have an opinion on how we should we go forward with this issue? -- Regards, Varun Thacker
