Hi, Thanks Willy for looking into this !
On Tue, Oct 08, Willy Tarreau wrote: > On Fri, Oct 04, 2019 at 07:28:15PM +0300, Jarno Huuskonen wrote: > > I sent pcap/strace offlist. > > Thanks, that was very useful. > > > (strace -f -o -ttt, tcpdump -n -p -s 16384 -w ... host 127.0.0.1 and > > port 8080). > > > > I think in packet capture the second health checks causes > > "client_addr: 127.0.0.1 client_port: 2779] hr_read(): Connection reset by > > peer [plugins/http/http.c line 917]" > > (I think uswgi logs client_port incorrectly, ntohs(2779) gives 56074 > > (and port 56074 is in packet capture)). > > It's as usual, when you start to troubleshoot a bug you find a myriad of > other ones around :-) > > > (haproxy version: HA-Proxy version 2.1-dev2 2019/10/01). > > > > I tried to reproduce with very minimal flask/uwsgi hello world app > > and there hr_read happens very rarely. > > With alerta(.io) app this happens more regularly (AFAIK not with every > > check). > > So maybe this is weird timing issue or bug in uwsgi. > > So it is indeed a timing issue. The check running on port 56062 shows the > following sequence: > > 17:10:57.877876 IP localhost.56062 > localhost.8080: Flags [S] > 17:10:57.877898 IP localhost.8080 > localhost.56062: Flags [S.] > 17:10:57.877909 IP localhost.56062 > localhost.8080: Flags [.] > 17:10:57.878065 IP localhost.56062 > localhost.8080: Flags [P.] > 17:10:57.878078 IP localhost.8080 > localhost.56062: Flags [.] > 17:10:57.879933 IP localhost.8080 > localhost.56062: Flags [P.] > 17:10:57.879939 IP localhost.56062 > localhost.8080: Flags [.] > 17:10:57.880008 IP localhost.8080 > localhost.56062: Flags [F.] > 17:10:57.880333 IP localhost.56062 > localhost.8080: Flags [F.] > 17:10:57.880341 IP localhost.8080 > localhost.56062: Flags [.] > > Note the FIN sent 75 microseconds after the response. This resulted in > recvfrom() returning zero and the connection to be cleanly closed. Now > regarding port 56074 that is causing excess logs : > > 17:11:04.132867 IP localhost.56074 > localhost.8080: Flags [S] > 17:11:04.132890 IP localhost.8080 > localhost.56074: Flags [S.] > 17:11:04.132904 IP localhost.56074 > localhost.8080: Flags [.] > 17:11:04.133083 IP localhost.56074 > localhost.8080: Flags [P.] > 17:11:04.133098 IP localhost.8080 > localhost.56074: Flags [.] > 17:11:04.135101 IP localhost.8080 > localhost.56074: Flags [P.] > 17:11:04.135107 IP localhost.56074 > localhost.8080: Flags [.] > 17:11:04.135316 IP localhost.56074 > localhost.8080: Flags [R.] > > As you can see, even 215 microseconds after the response there is still > no FIN. I've checked and in both cases the headers are the same. There > is indeed a "connection: close" header emitted by haproxy, none is > advertised in the response, but clearly it's a matter of delay. And > this causes recvfrom() to return EAGAIN, so haproxy is forced to hard- > close the connection (as if it were doing option http-server-close). > Please also note that returned data are properly ACKed so there is not > much more we can do there. > > Before the patch that you bisected, you didn't face this because of a > bug that was preventing the hard close from working, and instead you > were accumulating TIME_WAIT sockets on the client side. So I'm afraid > to say that for now the best I can say is that you'll still have to > live with these occasional logs :-/ Thanks. Glad to see this wasn't a bug on haproxy side. I tested the alerta app with gunicorn and gunicorn doesn't seem to mind that haproxy uses forced close on the health check, so switching uwsgi->gunicorn is another option. > I'd really like to see a massive rework being done on the checks. Your > report made me realize that there is one feature we're currently missing, > which is the ability to wait for a close once we've got a response. At > the moment we cannot add this because checks work in 2 steps : > > 1) connect and send the "request" > 2) receive the "reponse" and close. And AFAIK this works just fine with most sane backend servers. > There's no real state machine there, it's roughly "if request was not > sent, send it, otherwise try to receive a response; if response is > received, then process it and close otherwise wait for response". So > we cannot remain in a "waiting for closing" state after we receive a > response. I'm wondering if that couldn't be achieved using tcp-checks > however. I quickly tried but couldn't find a reliable way of doing this > but I'm thing that we could possibly extend the tcp-check rules to have > "tcp-check expect close" that would wait for a read0. Should there be a somekind of timeout how long to wait for close ? I'm thinking about backend servers that have very long/infinite keepalive (and/or don't respect Connection: close). -Jarno -- Jarno Huuskonen

