Hi Willy, On Sun, May 05, 2019 at 07:07:21AM +0200, Willy Tarreau wrote: > Hi William, > > we got a similar issue with last v1.9.7+HEAD > At first I thought you were again on a deadlock that I couldn't spot, due > to the fact that nearly all threads were waiting on the LB lock, and I > couldn't find how this could happen. But I didn't notice this one which > is the most important : > > > Thread 15 (Thread 0x7fe9b6631700 (LWP 2808)): > > #0 0x000056153d96d7a0 in __eb_insert_dup (new=0x56157f52f424, > > sub=0x56157f5640a4) at ebtree/ebtree.h:478 > > #1 eb_insert_dup (sub=<optimized out>, new=0x56157f52f424) at > > ebtree/ebtree.c:31 > > #2 0x000056153d96df10 in __eb32_insert (new=new@entry=0x56157f52f424, > > root=<optimized out>, root@entry=0x56157deb4140) at ebtree/eb32tree.h:337 > > #3 eb32_insert (root=root@entry=0x56157deb4140, > > new=new@entry=0x56157f52f424) at ebtree/eb32tree.c:27 > > #4 0x000056153d957fcb in fwrr_queue_srv (s=s@entry=0x56157f52f080) at > > src/lb_fwrr.c:371 > > #5 0x000056153d9585e8 in fwrr_update_server_weight (srv=0x56157f52f080) at > > src/lb_fwrr.c:242 > > #6 0x000056153d8ae8ac in srv_update_status (s=0x56157f52f080) at > > src/server.c:4923 > > #7 0x000056153d8adfc2 in server_recalc_eweight > > (sv=sv@entry=0x56157f52f080, must_update=must_update@entry=1) at > > src/server.c:1310 > > #8 0x000056153d8b6edd in server_warmup (t=0x5615899be8a0, > > context=0x56157f52f080, state=<optimized out>) at src/checks.c:1492 > > #9 0x000056153d94d97a in process_runnable_tasks () at src/task.c:390 > > #10 0x000056153d8c5c4f in run_poll_loop () at src/haproxy.c:2661 > > #11 run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2726 > > #12 0x00007fe9bd5e7dd5 in start_thread () from /lib64/libpthread.so.0 > > #13 0x00007fe9bc320ead in clone () from /lib64/libc.so.6 > > Thus I conclude that it crashed, and that all other threads just met at > the same lock while the core was being dumped in this one. I figured what > was missing, the server_warmup() function was missing a lock since 1.8. > I've just fixed this and backported it to 1.9. I would be grateful if > you could test it again, as I failed to reproduce the issue (it requires > a high concurrency and bad luck, as often in such cases).
thank you very much for the patch. we are pushing it today. http://git.haproxy.org/?p=haproxy-1.9.git;a=commit;h=207ba5a6bc1c03f2ba15ac3cd49bfa756fb760bb for reference I however don't know when I will be able to confirm as the issue was not showing that often. > Or maybe the tree got corrupted and __eb_insert_dup() entered an endless > loop. If that's the case (I mean if it froze and didn't crash), I may > have something to make this safer soon. I more or less managed to create > a watchdog timer to detect lockups and abort the whole process with a > trace when this happens. This will avoid keeping a faulty process in > prod and may even allow a quicker restart. I don't intend to backport > it to 1.9 though but depending on how effective and helpful it is, I > could change my mind. In all cases I don't want to use such solutions > to hide the dust under the carpet but instead to take detailed traces > without requiring human intervention when this happens. I like the idea :) It would be nice for us to gather those info later! -- William

