On 4/25/2018 1:29 AM, Lukas Tribus wrote:
You seem to be able to reproduce this easily, so please share the logs
when this happens including the requests (don't use dontlognull), so
that we can see the server up/down events and the all the successful
and failing requests together with timestamps, return status and
return codes.
I can't mess too much with the setup where I saw the problem, because
it's actively being used for some critical work right now, but I do have
another install where I'm using the backup keyword for Solr servers
behind haproxy.
backend be_sp
description Solr backend for the Spark index.
option httpchk GET /solr/sparkmain/admin/ping
balance leastconn
timeout check 4990
server idxa6 10.100.0.250:8981 check inter 5s fastinter 2s rise 3
fall 2 weight 100
server idxb6 10.100.0.251:8981 check inter 5s fastinter 2s rise 3
fall 2 weight 100 backup
server idxa3 10.100.0.244:8981 check inter 15s fastinter 2s rise 2
fall 1 weight 30 backup
server idxb3 10.100.0.245:8981 check inter 15s fastinter 2s rise 2
fall 1 weight 20 backup
server bigindy5 10.100.1.39:8982 check inter 15s fastinter 2s rise 2
fall 1 weight 10 backup
I tried a similar experiment on that setup, and couldn't see the same
behavior. With a loop sending requests using curl every two seconds, I
only got one 503 "no server available" response. Here's some logs from
haproxy:
Apr 25 10:10:32 localhost haproxy[20272]: 10.2.0.48:48065
[25/Apr/2018:10:10:31.358] fe_sp_8986 be_sp/idxa6 0/1001/-1/-1/1002 503
212 - - SC-- 5/0/0/0/1 0/0 "GET
/solr/sparkmain/admin/ping?echoParams=none&shards.info=false HTTP/1.1"
Apr 25 10:10:32 localhost haproxy[20272]: Server be_sp/idxa6 is DOWN,
reason: Layer4 connection problem, info: "Connection refused", check
duration: 0ms. 0 active and 4 backup servers left. Running on backup. 0
sessions active, 0 requeued, 0 remaining in queue.
Apr 25 10:10:34 localhost haproxy[20272]: 10.2.0.48:48066
[25/Apr/2018:10:10:34.384] fe_sp_8986 be_sp/idxb6 0/0/0/14/14 200 371 -
- ---- 5/1/0/1/0 0/0 "GET
/solr/sparkmain/admin/ping?echoParams=none&shards.info=false HTTP/1.1"
And here's the two responses corresponding to those logs:
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">13</int></lst><str name="status">OK</str>
</response>
What I just saw with this is more in line with the behavior I'm after.
It would be nice if I didn't get any request failures at all, but having
failures happen for a very short time isn't a problem.
When I did this before on the planet/hollywood backend, I got a whole
bunch of "no server available" responses in a row from the curl-based
script I was running. The one where I saw the problem is running
1.5.12, this one where things seem to work right is running 1.5.14.
Thanks,
Shawn