Hi Jens, On Fri, Jul 26, 2024 at 07:51:47PM +0200, Jens Wahnes wrote: > Hi everyone, > > I'm trying to move from Haproxy 2.4 to 2.8 and encountered some trouble with > 2.8 that did not occur on 2.4. > Specifically, this seems to concern plain HTTP connections only, i.e. > non-HTTPS traffic. I have not seen an example of this happening with HTTPS > connections. However, it could also be that it is specific to HTTP/1 as > opposed to HTTP/2 (we do not have any unencrypted HTTP/2 connections). > > Some of the backend connections are not properly closed with Haproxy 2.8 and > keep lingering for hours or days. That is, the client connection is gone, > but the backend connection is still there in state CLOSE_WAIT. This in turn > causes trouble with seamless reloads: the old Haproxy processes never > terminate and as several newer ones pile up, they take up memory. > > I have not experienced this with Haproxy 2.4.27. On 2.4.27, these backend > TCP connections were properly closed. It only began to be a problem with > never versions like 2.8.10. > > To give an example, a connection stuck in CLOSE_WAIT may look this in `ss` > output: > > ``` > CLOSE-WAIT 0 0 172.16.240.53:48528 > 172.16.240.86:2000 users:(("haproxy",pid=1099145,fd=239)) > ``` > > In Haproxy's own `show sess`, the above example would look like this: > > ``` > 0x7efc6b7b5400: [26/Jul/2024:18:24:50.308214] id=237988 proto=tcpv4 > source=10.22.113.130:50205 > flags=0x53c0a, conn_retries=0, conn_exp=<NEVER> conn_et=0x000 > srv_conn=(nil), pend_pos=(nil) waiting=0 epoch=0 > frontend=Loadbalancer (id=48 mode=http), listener=http (id=5) > addr=10.210.18.56:80 > backend=_std_ (id=3 mode=http) addr=172.16.240.53:48528 > server=hermes (id=1) addr=172.16.240.86:2000 > task=0x7efc72205ec0 (state=0x00 nice=0 calls=6 rate=0 exp=<NEVER> > tid=1(1/1) age=17m41s) > txn=0x7efc6bf72aa0 flags=0x0 meth=1 status=403 req.st=MSG_CLOSED > rsp.st=MSG_CLOSING req.f=0x44 rsp.f=0x0d > scf=0x7efc7233e8a0 flags=0x00030406 state=CLO > endp=CONN,0x7efc6d080ec0,0x04014a01 sub=0 rex=<NEVER> wex=<NEVER> > h1s=0x7efc6d080ec0 h1s.flg=0x104010 .sd.flg=0x4014a01 > .req.state=MSG_DONE .res.state=MSG_DATA > .meth=GET status=403 .sd.flg=0x04014a01 .sc.flg=0x00030406 > .sc.app=0x7efc6b7b5400 .subs=(nil) > h1c=0x7efc6d5f5d40 h1c.flg=0x20d00 .sub=0 .ibuf=0@(nil)+0/0 > .obuf=0@(nil)+0/0 .task=0x7efc6d004360 .exp=<NEVER> > co0=0x7efc7230f140 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM > target=LISTENER:0x7efc712d5e00 > flags=0x801c0300 fd=194 fd.state=1c22 updt=0 fd.tmask=0x2 > scb=0x7efc6b7c85e0 flags=0x0003501b state=CLO > endp=CONN,0x7efc6d095bc0,0x05004a01 sub=0 rex=<NEVER> wex=<NEVER> > h1s=0x7efc6d095bc0 h1s.flg=0x4010 .sd.flg=0x5004a01 > .req.state=MSG_DONE .res.state=MSG_DONE > .meth=GET status=403 .sd.flg=0x05004a01 .sc.flg=0x0003501b > .sc.app=0x7efc6b7b5400 .subs=(nil) > h1c=0x7efc6d087280 h1c.flg=0x80000100 .sub=0 .ibuf=0@(nil)+0/0 > .obuf=0@(nil)+0/0 .task=0x7efc6d052500 .exp=<NEVER> > co1=0x7efc723ede00 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM > target=SERVER:0x7efc71e48000 > flags=0x00040300 fd=239 fd.state=1122 updt=0 fd.tmask=0x2 > req=0x7efc6b7b5420 (f=0x848000 an=0x40000 pipe=0 tofwd=0 total=113) > an_exp=<NEVER> buf=0x7efc6b7b5428 data=(nil) o=0 p=0 i=0 size=0 > htx=0x8c65c0 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0 > res=0x7efc6b7b5470 (f=0xa0240000 an=0x24000000 pipe=2310 tofwd=0 > total=25601) > an_exp=<NEVER> buf=0x7efc6b7b5478 data=(nil) o=0 p=0 i=0 size=0 > htx=0x8c65c0 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0 > ``` > > Issuing a `shutdown session` command does not resolve the issue (e.g. > `shutdown session 0x7efc6b7b5400` in this example). So far, killing the > whole process has been the only way for me to get rid of these leftover > connections.
First, thanks a lot for all the details. I'm seeing a non-zero "pipe" value above, indicating you're using splicing between the two sides. Could you please comment whatever "option splice*" you have in your config so that we can confirm whether it's related or not ? > So my question is: does this ring a bell for anyone? Maybe something that > has been fixed in newer versions (like 3.0)? No, it really doesn't ring any bell right now. In any case if such a bug had been identified, the fix would have been backported to affected versions. I'm not seeing anything related post-2.8 either. However I'm seeing a few post-2.9 changes related to splicing, so it may be possible that it has accidently been fixed. As such it would be very useful if you could test 2.9 or 3.0, as this could help us figure whether it is only 2.8-specific or affects any version starting from 2.8. Also, if you want to experiment something regarding the possibilities to close such FDs, you can spot the FD number after the second "fd==" in "show sess" (backend side), and forcefully close it by issuing: expert-mode on; debug dev clode XXX where XXX is the FD number. Then issue the "show sess" again on the faulty stream to see if it's gone or if it changed. If still there, its new flags can also help figure the problem. Just be aware that forcefully closing an FD like this may trigger problems, e.g. CPU spinning at 100% trying to report an unhandled event or something like this. I don't think it's something to scare you given that you already tried some such approaches but I preferred to let you know ;-) Thanks, Willy