Le 30/07/2024 à 17:23, Jens Wahnes a écrit :
Thanks for the confirmation. On 3.0, I was unable to reproduce the
issue. So I'm not surprised.
On version 3.0.3 with splicing turned on, I actually did end up with a
backend connection in state CLOSE_WAIT that is still around after some
hours. But it is different from the other cases I saw (with version
2.8.10). This one on 3.0.3 is an HTTPS connection (likely using HTTP/2)
on the frontend side and the backend connection is HTTP/1. What is also
a bit special is that the HTTP response code is 204, so there is no
"real" data being transmitted.
That's interesting because here, there is no kernel splicing. So my fixes will
not catch this case. But your are using the bandwidth limitation filter. It may
be related. I must investigate a bit. It seems unrelated with the previous issue
however. Because when the splicing is used, no filter can be active.
Could you share your bwlim configuration please ?
So I'm not sure if this one is really related to the other ones or if
it's just a coincidence that I'm seeing this happen as I'm looking for
other connections not being closed properly. :)
The associated `show sess` looks like this (IP addresses slightly altered):
```
0x7fc63caccc00: [30/Jul/2024:14:05:50.626096] id=123594 proto=tcpv4
source=10.80.119.118:53864
flags=0x3384a, conn_retries=0, conn_exp=<NEVER> conn_et=0x000
srv_conn=0x7fc63f0a6000, pend_pos=(nil) waiting=0 epoch=0
frontend=Loadbalancer (id=48 mode=http), listener=https (id=15)
addr=10.210.18.56:443
backend=projekt_piwik_2019 (id=94 mode=http) addr=172.16.240.53:50858
server=counterstrike (id=1) addr=172.16.240.99:2501
task=0x7fc63eb6f260 (state=0x00 nice=400 calls=6 rate=0 exp=<NEVER>
tid=1(1/1) age=2h20m)
txn=0x7fc63ed00320 flags=0x40000 meth=3 status=-1 req.st=MSG_DONE
rsp.st=MSG_RPBEFORE req.f=0x4d rsp.f=0x00
scf=0x7fc642deb620 flags=0x00070006 ioto=1m state=CLO
endp=CONN,0x7fc63ebbdf00,0x5043d601 sub=0 rex=<NEVER> wex=<NEVER> rto=?
wto=<NEVER>
iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0
h2s=0x7fc63ebbdf00 h2s.id=15 .st=CLO .flg=0x4109 .rxbuf=0@(nil)+0/0
.sc=0x7fc642deb620(.flg=0x00070006 .app=0x7fc63caccc00)
.sd=0x7fc642d25770(.flg=0x5043d601)
.subs=(nil)
h2c=0x7fc642d8f200 h2c.st0=FRH .err=0 .maxid=15 .lastid=-1
.flg=0x1a60600 .nbst=0 .nbsc=1, .glitches=0
.fctl_cnt=0 .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=0 .dsi=15
.dbuf=0@(nil)+0/0
.mbuf=[1..1|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0]
.task=0x7fc642c0af60 .exp=<NEVER>
co0=0x7fc63cb2d320 ctrl=tcpv4 xprt=SSL mux=H2 data=STRM
target=LISTENER:0x7fc641a5a400
flags=0x801c0300 fd=237 fd.state=1922 updt=0 fd.tmask=0x2
scb=0x7fc642da50e0 flags=0x00001013 ioto=10m state=EST
endp=CONN,0x7fc63ebbb200,0x50404001 sub=1 rex=<NEVER> wex=<NEVER> rto=?
wto=<NEVER>
iobuf.flags=0x00000000 .pipe=0 .buf=0@(nil)+0/0
h1s=0x7fc63ebbb200 h1s.flg=0x94010 .sd.flg=0x50404001
.req.state=MSG_DONE .res.state=MSG_DONE
.meth=POST status=204 .sd.flg=0x50404001 .sc.flg=0x00001013
.sc.app=0x7fc63caccc00
.subs=0x7fc642da50f8(ev=1 tl=0x7fc642da9c40 tl.calls=4
tl.ctx=0x7fc642da50e0 tl.fct=sc_conn_io_cb)
h1c=0x7fc63ca5d840 h1c.flg=0x80000000 .sub=0 .ibuf=0@(nil)+0/0
.obuf=0@(nil)+0/0 .task=0x7fc63cad2940 .exp=<NEVER>
co1=0x7fc63ebb6820 ctrl=tcpv4 xprt=RAW mux=H1 data=STRM
target=SERVER:0x7fc63f0a6000
flags=0x00000300 fd=167 fd.state=11122 updt=0 fd.tmask=0x2
filters={0x7fc642d6fb30="bandwidth limitation filter"}
req=0x7fc63caccc28 (f=0x20840000 an=0x48000 tofwd=0 total=993)
an_exp=<NEVER> buf=0x7fc63caccc30 data=0x7fc63cb52280 o=0 p=0
i=16384 size=16384
htx=0x7fc63cb52280 flags=0x10 size=16336 data=1 used=1 wrap=NO
extra=0
res=0x7fc63caccc70 (f=0x80008000 an=0x20000000 tofwd=0 total=274)
an_exp=<NEVER> buf=0x7fc63caccc78 data=0x7fc63c6a7dc0 o=274 p=274
i=16110 size=16384
htx=0x7fc63c6a7dc0 flags=0x10 size=16336 data=274 used=11 wrap=NO
extra=0
```
I pushed some patches that should fix your issue. They cannot be applied
as-is on the 2.8. You can use attached patches for the 2.8 if you want
to try. It could help to be sure they properly fix your issue.
Thank you. I applied the patches and compiled a version of 2.8.10 based
on this. However, I'm hesitating to run it just yet, as I am uncertain
if the above example of a "stuck" session in version 3.0.3 is worth
another look. If you would like me to perform any action on that session
to diagnose further, like the "close FD" or anything else, please let me
know.
Otherwise, I'd try the version of 2.8.10 with your patched applied.
Well, I'm annoyed. At first glance both issues are unrelated but symptoms are
too similar to be a coincidence. So I may have missed something the first time,
distracted by the splicing. So, for now, it is better to wait a bit before
testing my fixes. I hope to find the root cause of the issue quickly.
Thanks !
--
Christopher Faulet