Le 28/11/2017 à 07:25, Willy Tarreau a écrit :
Hi Pieter,
On Mon, Nov 27, 2017 at 09:43:52PM +0100, PiBa-NL wrote:
Hi List,
I thought i 'reasonably' tested some of 1.8.0's options.
Today i put it into 'production' on my secondary cluster node and notice it
takes 100% cpu...
Grrrr.. bad. This sounds like another case of recursive locking.
I guess i should have tried such a thing last week.
Don't worry, whatever the amount of tests you run, some bugs will always
slip through.
Anyhow below some gdb and console output.
Very useful, I found it :
process_chk_conn() takes the lock then calls connect_conn_chk() :
2114 HA_SPIN_LOCK(SERVER_LOCK, &check->server->lock);
2137 ret = connect_conn_chk(t);
connect_conn_chk() then calls tcpcheck_main() :
1548 tcpcheck_main(check);
And this one takes the lock again :
2598 HA_SPIN_LOCK(SERVER_LOCK, &check->server->lock);
CCing Emeric as he's the one who covered the checks so he will know best
how to fix it.
In the mean time, if you don't need threads you can rebuild with "USE_THREAD="
to disable them, but I'd rather wait for a fix. Sorry about that, and thaks
for the report.
Hi Pieter,
Here is a patch that should fix the deadlock. Could you confirm it fixes
your bug ?
Emeric, this patch should be good, but take a look on it, just to be sure.
Thanks
--
--
Christopher Faulet
>From 5fd4083becd141080ec8cf0923b222e0ae6119af Mon Sep 17 00:00:00 2001
From: Christopher Faulet <[email protected]>
Date: Tue, 28 Nov 2017 10:06:29 +0100
Subject: [PATCH] BUG/MEDIUM: tcp-check: Don't lock the server in tcpcheck_main
There was a deadlock in tcpcheck_main function. The server's lock was already
acquired by the caller (process_chk_conn or wake_srv_chk).
This patch must be backported in 1.8.
---
src/checks.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/src/checks.c b/src/checks.c
index ad25ac094..63747201e 100644
--- a/src/checks.c
+++ b/src/checks.c
@@ -2595,8 +2595,6 @@ static int tcpcheck_main(struct check *check)
struct list *head = check->tcpcheck_rules;
int retcode = 0;
- HA_SPIN_LOCK(SERVER_LOCK, &check->server->lock);
-
/* here, we know that the check is complete or that it failed */
if (check->result != CHK_RES_UNKNOWN)
goto out_end_tcpcheck;
@@ -2637,7 +2635,7 @@ static int tcpcheck_main(struct check *check)
if (s->proxy->timeout.check)
t->expire = tick_first(t->expire, t_con);
}
- goto out_unlock;
+ goto out;
}
/* special case: option tcp-check with no rule, a connect is enough */
@@ -2732,7 +2730,7 @@ static int tcpcheck_main(struct check *check)
chunk_appendf(&trash, " comment: '%s'", comment);
set_server_check_status(check, HCHK_STATUS_SOCKERR, trash.str);
check->current_step = NULL;
- goto out_unlock;
+ goto out;
}
if (check->cs)
@@ -2854,7 +2852,7 @@ static int tcpcheck_main(struct check *check)
if (s->proxy->timeout.check)
t->expire = tick_first(t->expire, t_con);
}
- goto out_unlock;
+ goto out;
}
} /* end 'connect' */
@@ -3059,7 +3057,7 @@ static int tcpcheck_main(struct check *check)
if (&check->current_step->list != head &&
check->current_step->action == TCPCHK_ACT_EXPECT)
__cs_want_recv(cs);
- goto out_unlock;
+ goto out;
out_end_tcpcheck:
/* collect possible new errors */
@@ -3074,8 +3072,7 @@ static int tcpcheck_main(struct check *check)
__cs_stop_both(cs);
- out_unlock:
- HA_SPIN_UNLOCK(SERVER_LOCK, &check->server->lock);
+ out:
return retcode;
}
--
2.13.6