On Fri, 11 Dec 2020 at 15:21, Christopher Faulet <[email protected]> wrote: > > Le 11/12/2020 à 11:45, Christopher Faulet a écrit : > > Le 10/12/2020 à 19:38, Peter Statham a écrit : > >> > Sorry for the delay in getting back to you. It is the same crash, > >> > we've been trying to narrow down the exact combination of compiler, > >> > libraries, kernel, hypervisor, etc. that causes the issue now that we > >> > know it isn't universal but that's turning out to be trickier than > >> > identifying the issue. > >> > > >> > I only backported the changes to the src/lb_fwlc.c file, but > >> > backporting 1b87748ff5 seems to work just as well. So far we haven't > >> > been able to provoke the issue with the changes in 1b87748ff5 applied > >> > to the 1.8 tree so that does look like a solution. > >> > > >> > We will keep testing and trying to narrow the issue down. > >> > >> Since I wrote the above I have managed to replicate the issue on 1.8 with > >> applied, so it looks as if that was not the solution after all. > >> > >> I include a binary built from 1.8.27 with 1b87748ff5 backported and a core > >> dump. > >> > >> haproxy-1.8.27+1b87748ff5 > >> <https://drive.google.com/file/d/1KPs3rBpkeqE9GEOfjF8Ocycd1wa4RjqW/view?usp=drive_web> > >> haproxy-1.8.27+1b87748ff5.core > >> <https://drive.google.com/file/d/1chBPoogHBuGlnV1o5sO9YP6BldpRH4d3/view?usp=drive_web> > >> > > > > > > Thanks Peter, I'll try to take a look today. The reproducer is the same ? > > > > Ok, in fact it is pretty easy to reproduce. Because I found a similar bug on > newer versions, I have not tested on the 1.8. Unfortunately, there is second > bug, specific to the 1.8. > > I attached a patch that should fix it. In fact, the bug exists because of the > rendez-vous point. It was removed on newer versions. But, on 1.8, there may > have > a short time to commit server state changes because we must wait for all > threads. Thus, we must take care to not use info of the next state too early. > And this is the bug here. In the leasconn algo, the next server weight is > used, > instead of the current one, to reposition the server in the tree. The next > server weight must only be used when the server state changes are committed. > > Peter, could you confirm it fixes you bug ? > -- > Christopher Faulet
The patch seems to fix the issue. I've built a new version of haproxy 1.8.27 with the patch applied on both Debian and CentOS under VMWare. I then ran these builds concurrently with my previous builds on both platforms using configuration files that are identical save for the bind address. I can reproduce the bug with the existing build but not with the one with your patch applied. I'll ask some of my colleagues to double check my tests. -- Peter Statham Loadbalancer.org Ltd.

