Hi Christopher,

Christopher Faulet wrote:
Le 29/07/2024 à 16:30, Jens Wahnes a écrit :
Christopher Faulet wrote:
Le 29/07/2024 à 09:05, Christopher Faulet a écrit :

Thanks, I will investigate. It is indeed most probably an issue with the
splicing, as Willy said. I will try to find the bug on the 2.8 and
figure out if
upper versions are affected too.

I'm able to reproduce the issue by hacking the code, forcing a
connection error by hand. It occurs when an error is reported on the
connection when haproxy tries to send data using kernel splicing. But it
is only an issue when a filter is attached to the applicative stream. I
guess you have enabled the HTTP compression. The response is not
compressed of course, otherwise the kernel splicing would not be used.
But it is still attached to the stream and it has an effect in this case.

AFAIK, the older versions are not affected. On newer versions, I don't
really know. There is an issue with my hack, but timeouts are still
active and a true client abort is properly detected. So, I'm inclined to
think there is no issue on these versions. But my fix will probably be
applicable too.

I'm on the fix. I must test when this happens on server side, to be
sure. But it should be fixed soon.


Thank you for the update.

My results so far: Everything is fine on 2.8.10 without splicing.

On 3.0.3 with splicing turned on, I have also not seen any lingering
sessions, but I have only been running version 3.0.3 for some hours now,
so this could still happen. I'll rather let it run for some more time
before drawing conclusions.


Thanks for the confirmation. On 3.0, I was unable to reproduce the issue. So I'm not surprised.

On version 3.0.3 with splicing turned on, I actually did end up with a backend connection in state CLOSE_WAIT that is still around after some hours. But it is different from the other cases I saw (with version 2.8.10). This one on 3.0.3 is an HTTPS connection (likely using HTTP/2) on the frontend side and the backend connection is HTTP/1. What is also a bit special is that the HTTP response code is 204, so there is no "real" data being transmitted.

So I'm not sure if this one is really related to the other ones or if it's just a coincidence that I'm seeing this happen as I'm looking for other connections not being closed properly. :)

The associated `show sess` looks like this (IP addresses slightly altered):

```
0x7fc63caccc00: [30/Jul/2024:14:05:50.626096] id=123594 proto=tcpv4 source=10.80.119.118:53864 flags=0x3384a, conn_retries=0, conn_exp=<NEVER> conn_et=0x000 srv_conn=0x7fc63f0a6000, pend_pos=(nil) waiting=0 epoch=0 frontend=Loadbalancer (id=48 mode=http), listener=https (id=15) addr=10.210.18.56:443
  backend=projekt_piwik_2019 (id=94 mode=http) addr=172.16.240.53:50858
  server=counterstrike (id=1) addr=172.16.240.99:2501
task=0x7fc63eb6f260 (state=0x00 nice=400 calls=6 rate=0 exp=<NEVER> tid=1(1/1) age=2h20m) txn=0x7fc63ed00320 flags=0x40000 meth=3 status=-1 req.st=MSG_DONE rsp.st=MSG_RPBEFORE req.f=0x4d rsp.f=0x00 scf=0x7fc642deb620 flags=0x00070006 ioto=1m state=CLO endp=CONN,0x7fc63ebbdf00,0x5043d601 sub=0 rex=<NEVER> wex=<NEVER> rto=? wto=<NEVER>
    iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0
      h2s=0x7fc63ebbdf00 h2s.id=15 .st=CLO .flg=0x4109 .rxbuf=0@(nil)+0/0
.sc=0x7fc642deb620(.flg=0x00070006 .app=0x7fc63caccc00) .sd=0x7fc642d25770(.flg=0x5043d601)
       .subs=(nil)
h2c=0x7fc642d8f200 h2c.st0=FRH .err=0 .maxid=15 .lastid=-1 .flg=0x1a60600 .nbst=0 .nbsc=1, .glitches=0 .fctl_cnt=0 .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=15 .dbuf=0@(nil)+0/0 .mbuf=[1..1|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0] .task=0x7fc642c0af60 .exp=<NEVER> co0=0x7fc63cb2d320 ctrl=tcpv4 xprt=SSL mux=H2 data=STRM target=LISTENER:0x7fc641a5a400
      flags=0x801c0300 fd=237 fd.state=1922 updt=0 fd.tmask=0x2
scb=0x7fc642da50e0 flags=0x00001013 ioto=10m state=EST endp=CONN,0x7fc63ebbb200,0x50404001 sub=1 rex=<NEVER> wex=<NEVER> rto=? wto=<NEVER>
    iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0
h1s=0x7fc63ebbb200 h1s.flg=0x94010 .sd.flg=0x50404001 .req.state=MSG_DONE .res.state=MSG_DONE .meth=POST status=204 .sd.flg=0x50404001 .sc.flg=0x00001013 .sc.app=0x7fc63caccc00 .subs=0x7fc642da50f8(ev=1 tl=0x7fc642da9c40 tl.calls=4 tl.ctx=0x7fc642da50e0 tl.fct=sc_conn_io_cb) h1c=0x7fc63ca5d840 h1c.flg=0x80000000 .sub=0 .ibuf=0@(nil)+0/0 .obuf=0@(nil)+0/0 .task=0x7fc63cad2940 .exp=<NEVER> co1=0x7fc63ebb6820 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM target=SERVER:0x7fc63f0a6000
      flags=0x00000300 fd=167 fd.state=11122 updt=0 fd.tmask=0x2
  filters={0x7fc642d6fb30="bandwidth limitation filter"}
  req=0x7fc63caccc28 (f=0x20840000 an=0x48000 tofwd=0 total=993)
an_exp=<NEVER> buf=0x7fc63caccc30 data=0x7fc63cb52280 o=0 p=0 i=16384 size=16384 htx=0x7fc63cb52280 flags=0x10 size=16336 data=1 used=1 wrap=NO extra=0
  res=0x7fc63caccc70 (f=0x80008000 an=0x20000000 tofwd=0 total=274)
an_exp=<NEVER> buf=0x7fc63caccc78 data=0x7fc63c6a7dc0 o=274 p=274 i=16110 size=16384 htx=0x7fc63c6a7dc0 flags=0x10 size=16336 data=274 used=11 wrap=NO extra=0

```

I pushed some patches that should fix your issue. They cannot be applied as-is on the 2.8. You can use attached patches for the 2.8 if you want to try. It could help to be sure they properly fix your issue.

Thank you. I applied the patches and compiled a version of 2.8.10 based on this. However, I'm hesitating to run it just yet, as I am uncertain if the above example of a "stuck" session in version 3.0.3 is worth another look. If you would like me to perform any action on that session to diagnose further, like the "close FD" or anything else, please let me know.

Otherwise, I'd try the version of 2.8.10 with your patched applied.


Jens



Reply via email to