Hi Willy,

On Sun, May 05, 2019 at 07:07:21AM +0200, Willy Tarreau wrote:
> Hi William,
> > we got a similar issue with last v1.9.7+HEAD
> At first I thought you were again on a deadlock that I couldn't spot, due
> to the fact that nearly all threads were waiting on the LB lock, and I
> couldn't find how this could happen. But I didn't notice this one which
> is the most important :
> 
> > Thread 15 (Thread 0x7fe9b6631700 (LWP 2808)):
> > #0  0x000056153d96d7a0 in __eb_insert_dup (new=0x56157f52f424, 
> > sub=0x56157f5640a4) at ebtree/ebtree.h:478
> > #1  eb_insert_dup (sub=<optimized out>, new=0x56157f52f424) at 
> > ebtree/ebtree.c:31
> > #2  0x000056153d96df10 in __eb32_insert (new=new@entry=0x56157f52f424, 
> > root=<optimized out>, root@entry=0x56157deb4140) at ebtree/eb32tree.h:337
> > #3  eb32_insert (root=root@entry=0x56157deb4140, 
> > new=new@entry=0x56157f52f424) at ebtree/eb32tree.c:27
> > #4  0x000056153d957fcb in fwrr_queue_srv (s=s@entry=0x56157f52f080) at 
> > src/lb_fwrr.c:371
> > #5  0x000056153d9585e8 in fwrr_update_server_weight (srv=0x56157f52f080) at 
> > src/lb_fwrr.c:242
> > #6  0x000056153d8ae8ac in srv_update_status (s=0x56157f52f080) at 
> > src/server.c:4923
> > #7  0x000056153d8adfc2 in server_recalc_eweight 
> > (sv=sv@entry=0x56157f52f080, must_update=must_update@entry=1) at 
> > src/server.c:1310
> > #8  0x000056153d8b6edd in server_warmup (t=0x5615899be8a0, 
> > context=0x56157f52f080, state=<optimized out>) at src/checks.c:1492
> > #9  0x000056153d94d97a in process_runnable_tasks () at src/task.c:390
> > #10 0x000056153d8c5c4f in run_poll_loop () at src/haproxy.c:2661
> > #11 run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2726
> > #12 0x00007fe9bd5e7dd5 in start_thread () from /lib64/libpthread.so.0
> > #13 0x00007fe9bc320ead in clone () from /lib64/libc.so.6
> 
> Thus I conclude that it crashed, and that all other threads just met at
> the same lock while the core was being dumped in this one. I figured what
> was missing, the server_warmup() function was missing a lock since 1.8.
> I've just fixed this and backported it to 1.9. I would be grateful if
> you could test it again, as I failed to reproduce the issue (it requires
> a high concurrency and bad luck, as often in such cases).

thank you very much for the patch. we are pushing it today.
http://git.haproxy.org/?p=haproxy-1.9.git;a=commit;h=207ba5a6bc1c03f2ba15ac3cd49bfa756fb760bb
for reference
I however don't know when I will be able to confirm as the issue was not
showing that often.

> Or maybe the tree got corrupted and __eb_insert_dup() entered an endless
> loop. If that's the case (I mean if it froze and didn't crash), I may
> have something to make this safer soon. I more or less managed to create
> a watchdog timer to detect lockups and abort the whole process with a
> trace when this happens. This will avoid keeping a faulty process in
> prod and may even allow a quicker restart. I don't intend to backport
> it to 1.9 though but depending on how effective and helpful it is, I
> could change my mind. In all cases I don't want to use such solutions
> to hide the dust under the carpet but instead to take detailed traces
> without requiring human intervention when this happens.

I like the idea :)
It would be nice for us to gather those info later!
-- 
William

Reply via email to