Hi Willy,

thanks so much for looking into this. I certainly didn't expect an answer over the weekend and I really appreciate your taking time to look into this. Please find my comments below.

Willy Tarreau wrote:
On Fri, Jul 26, 2024 at 07:51:47PM +0200, Jens Wahnes wrote:
Some of the backend connections are not properly closed with Haproxy 2.8 and
keep lingering for hours or days. That is, the client connection is gone,
but the backend connection is still there in state CLOSE_WAIT. This in turn
causes trouble with seamless reloads: the old Haproxy processes never
terminate and as several newer ones pile up, they take up memory.

To give an example, a connection stuck in CLOSE_WAIT may look this in `ss`
output:

```
CLOSE-WAIT 0      0                          172.16.240.53:48528
172.16.240.86:2000 users:(("haproxy",pid=1099145,fd=239))
```

In Haproxy's own `show sess`, the above example would look like this:

```
0x7efc6b7b5400: [26/Jul/2024:18:24:50.308214] id=237988 proto=tcpv4
source=10.22.113.130:50205
   flags=0x53c0a, conn_retries=0, conn_exp=<NEVER> conn_et=0x000
srv_conn=(nil), pend_pos=(nil) waiting=0 epoch=0
   frontend=Loadbalancer (id=48 mode=http), listener=http (id=5)
addr=10.210.18.56:80
   backend=_std_ (id=3 mode=http) addr=172.16.240.53:48528
   server=hermes (id=1) addr=172.16.240.86:2000
   task=0x7efc72205ec0 (state=0x00 nice=0 calls=6 rate=0 exp=<NEVER>
tid=1(1/1) age=17m41s)
   txn=0x7efc6bf72aa0 flags=0x0 meth=1 status=403 req.st=MSG_CLOSED
rsp.st=MSG_CLOSING req.f=0x44 rsp.f=0x0d
   scf=0x7efc7233e8a0 flags=0x00030406 state=CLO
endp=CONN,0x7efc6d080ec0,0x04014a01 sub=0 rex=<NEVER> wex=<NEVER>
       h1s=0x7efc6d080ec0 h1s.flg=0x104010 .sd.flg=0x4014a01
.req.state=MSG_DONE .res.state=MSG_DATA
       .meth=GET status=403 .sd.flg=0x04014a01 .sc.flg=0x00030406
.sc.app=0x7efc6b7b5400 .subs=(nil)
       h1c=0x7efc6d5f5d40 h1c.flg=0x20d00 .sub=0 .ibuf=0@(nil)+0/0
.obuf=0@(nil)+0/0 .task=0x7efc6d004360 .exp=<NEVER>
       co0=0x7efc7230f140 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM
target=LISTENER:0x7efc712d5e00
       flags=0x801c0300 fd=194 fd.state=1c22 updt=0 fd.tmask=0x2
   scb=0x7efc6b7c85e0 flags=0x0003501b state=CLO
endp=CONN,0x7efc6d095bc0,0x05004a01 sub=0 rex=<NEVER> wex=<NEVER>
       h1s=0x7efc6d095bc0 h1s.flg=0x4010 .sd.flg=0x5004a01
.req.state=MSG_DONE .res.state=MSG_DONE
       .meth=GET status=403 .sd.flg=0x05004a01 .sc.flg=0x0003501b
.sc.app=0x7efc6b7b5400 .subs=(nil)
       h1c=0x7efc6d087280 h1c.flg=0x80000100 .sub=0 .ibuf=0@(nil)+0/0
.obuf=0@(nil)+0/0 .task=0x7efc6d052500 .exp=<NEVER>
       co1=0x7efc723ede00 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM
target=SERVER:0x7efc71e48000
       flags=0x00040300 fd=239 fd.state=1122 updt=0 fd.tmask=0x2
   req=0x7efc6b7b5420 (f=0x848000 an=0x40000 pipe=0 tofwd=0 total=113)
       an_exp=<NEVER> buf=0x7efc6b7b5428 data=(nil) o=0 p=0 i=0 size=0
       htx=0x8c65c0 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0
   res=0x7efc6b7b5470 (f=0xa0240000 an=0x24000000 pipe=2310 tofwd=0
total=25601)
       an_exp=<NEVER> buf=0x7efc6b7b5478 data=(nil) o=0 p=0 i=0 size=0
       htx=0x8c65c0 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0
```

Issuing a `shutdown session` command does not resolve the issue (e.g.
`shutdown session 0x7efc6b7b5400` in this example). So far, killing the
whole process has been the only way for me to get rid of these leftover
connections.

First, thanks a lot for all the details. I'm seeing a non-zero "pipe"
value above, indicating you're using splicing between the two sides.
Could you please comment whatever "option splice*" you have in your
config so that we can confirm whether it's related or not ?

Yes, there are splice options in the frontend configuration:

option splice-response
option splice-request

However I'm seeing a few post-2.9 changes related to splicing, so it may
be possible that it has accidently been fixed. As such it would be very
useful if you could test 2.9 or 3.0, as this could help us figure whether
it is only 2.8-specific or affects any version starting from 2.8.

Thanks for clarifying.

For now, I'm running the same config, just without "splice-response" and without "splice-request", with 2.8.10 to see if the problem still appears. If there are no new sockets in CLOSE_WAIT state coming up with this setup, it should narrow things down to being a issue that has to do with splicing.

So if that proves successful, I'll try version 3.0.3 with splicing enabled, probably next week.

Also, if you want to experiment something regarding the possibilities to
close such FDs, you can spot the FD number after the second "fd==" in
"show sess" (backend side), and forcefully close it by issuing:

   expert-mode on; debug dev clode XXX

where XXX is the FD number. Then issue the "show sess" again on the
faulty stream to see if it's gone or if it changed. If still there, its
new flags can also help figure the problem. Just be aware that forcefully
closing an FD like this may trigger problems, e.g. CPU spinning at 100%
trying to report an unhandled event or something like this. I don't
think it's something to scare you given that you already tried some
such approaches but I preferred to let you know ;-)

That's a great suggestion, thank you. I was not aware of this special command.

To limit any potential harm, I first disabled the splicing feature in the config file, then reloaded Haproxy and waited until any "real" connections to the old process were gone. I then lowered the old processes' CPU priority to make sure that it won't hurt the overall performance should it go wild. With these precautions in place, I went to close the open FDs of the backend sessions. Doing it this safe way by sending commands to the old process was of course only possible thanks to the fabulous master socket feature of yours.

The good news is that nothing strange happened when I sent "close" commands for all FDs in CLOSE_WAIT state (some 20 connections since Friday evening). So there was no process going berserk and eating up CPU or anything like that. The bad news is that it didn't help much either. While the lingering backend connections are in fact gone (they do not show up in `ss` anymore), the sessions in the old Haproxy process remain and a `shutdown session` command on them has no effect still. This means that the old process is also not terminating on its own.

The example session that I mentioned before now looks like this after the close FD and shutdown session treatment:

```
0x7efc6b7b5400: [26/Jul/2024:18:24:50.308214] id=237988 proto=tcpv4 source=10.22.113.130:50205 flags=0x53c0a, conn_retries=0, conn_exp=<NEVER> conn_et=0x000 srv_conn=(nil), pend_pos=(nil) waiting=0 epoch=0 frontend=Loadbalancer (id=48 mode=http), listener=http (id=5) addr=10.210.18.56:80
  backend=_std_ (id=3 mode=http) addr=172.16.240.53:48528
  server=hermes (id=1) addr=172.16.240.86:2000
task=0x7efc72205ec0 (state=0x00 nice=0 calls=59 rate=0 exp=<NEVER> tid=1(1/1) age=20h31m) txn=0x7efc6bf72aa0 flags=0x0 meth=1 status=403 req.st=MSG_CLOSED rsp.st=MSG_CLOSING req.f=0x44 rsp.f=0x0d scf=0x7efc7233e8a0 flags=0x00030406 state=CLO endp=CONN,0x7efc6d080ec0,0x0401ca01 sub=0 rex=<NEVER> wex=<NEVER> h1s=0x7efc6d080ec0 h1s.flg=0x104010 .sd.flg=0x401ca01 .req.state=MSG_DONE .res.state=MSG_DATA .meth=GET status=403 .sd.flg=0x0401ca01 .sc.flg=0x00030406 .sc.app=0x7efc6b7b5400 .subs=(nil) h1c=0x7efc6d5f5d40 h1c.flg=0x20d00 .sub=0 .ibuf=0@(nil)+0/0 .obuf=0@(nil)+0/0 .task=0x7efc6d004360 .exp=27s co0=0x7efc7230f140 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM target=LISTENER:0x7efc712d5e00
      flags=0x801c0300 fd=194 fd.state=1c22 updt=0 fd.tmask=0x2
scb=0x7efc6b7c85e0 flags=0x0003d01b state=CLO endp=CONN,0x7efc6d095bc0,0x05004a01 sub=0 rex=<NEVER> wex=<NEVER> h1s=0x7efc6d095bc0 h1s.flg=0x4010 .sd.flg=0x5004a01 .req.state=MSG_DONE .res.state=MSG_DONE .meth=GET status=403 .sd.flg=0x05004a01 .sc.flg=0x0003d01b .sc.app=0x7efc6b7b5400 .subs=(nil) h1c=0x7efc6d087280 h1c.flg=0x80000100 .sub=0 .ibuf=0@(nil)+0/0 .obuf=0@(nil)+0/0 .task=0x7efc6d052500 .exp=<NEVER> co1=0x7efc723ede00 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM target=SERVER:0x7efc71e48000
      flags=0x00040300 fd=239 fd.state=00 updt=0 fd.tmask=0x0
  req=0x7efc6b7b5420 (f=0x848000 an=0x40000 pipe=0 tofwd=0 total=113)
      an_exp=<NEVER> buf=0x7efc6b7b5428 data=(nil) o=0 p=0 i=0 size=0
      htx=0x8c65c0 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0
res=0x7efc6b7b5470 (f=0xa0240000 an=0x24000000 pipe=2310 tofwd=0 total=25601)
      an_exp=<NEVER> buf=0x7efc6b7b5478 data=(nil) o=0 p=0 i=0 size=0
      htx=0x8c65c0 flags=0x0 size=0 data=0 used=0 wrap=NO extra=0
```

As far as I can tell, not much has changed. The `fd.state` is `00` now (was `1122` before).


Jens



Reply via email to