On Mon, Jan 22, 2018 at 06:16:12PM +0100, Willy Tarreau wrote:
> On Mon, Jan 22, 2018 at 05:47:55PM +0100, Willy Tarreau wrote:
> > > strace: Process 12166 attached
> > > [pid 12166] set_robust_list(0x7ff9bc9aa9e0, 24 <unfinished ...>
> > > [pid 12166] <... set_robust_list resumed> ) = 0
> > > [pid 12166] gettimeofday({1516289044, 684014}, NULL) = 0
> > > [pid 12166] mmap(NULL, 134217728, PROT_NONE, 
> > > MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
> > > [pid 12166] <... mmap resumed> )        = 0x7ff9ac000000
> > > [pid 12166] munmap(0x7ff9b0000000, 67108864) = 0
> > > [pid 12166] mprotect(0x7ff9ac000000, 135168, PROT_READ|PROT_WRITE 
> > > <unfinished ...>
> > > [pid 12166] <... mprotect resumed> )    = 0
> > > [pid 12166] mmap(NULL, 8003584, PROT_READ|PROT_WRITE, 
> > > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
> > > [pid 12166] <... mmap resumed> )        = 0x7ff9baa65000
> > > [pid 12166] close(16 <unfinished ...>
> > > [pid 12166] <... close resumed> )       = 0
> > > [pid 12166] fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
> > > [pid 12166] <... fcntl resumed> )       = 0
> > 
> > Here it's getting obvious that it was a shared file descriptor :-(
> 
> So I have a suspect here :
> 
>    - run_thread_poll_loop() runs after the threads are created
>    - first thing it does is to close the master-worker pipe FD :
> 
>         (...)
>       if (global.mode & MODE_MWORKER)
>               mworker_pipe_register(mworker_pipe);
>         (...)
> 
>      void mworker_pipe_register(int pipefd[2])
>      {
>       close(mworker_pipe[1]); /* close the write end of the master pipe in 
> the children */
>       fcntl(mworker_pipe[0], F_SETFL, O_NONBLOCK);
>         (...)
>      }
> 
>      Looks familiar with the trace above ?
> 
> So I guess your config works in master-worker mode, am I right ?

Exactly, I'm running HAProxy with the "-W -db" flags.

> Note that I'm bothered with the call to protocol_enable_all() as
> well in this function since it will start the proxies multiple times
> in a possibly unsafe mode. That may explain a lot of things suddenly!
> 
> I think the attached patch works around it, but I'd like your
> confirmation before cleaning it up.

I applied this single patch on top of 1.8.3, and indeed this seems much
better ! The servers are eventually in an UP state after reloading.

"grep -c EBADF" on yesterday's strace logfile returned around 900 matches.
18 matches with the patchset from your other email, and only 1 with this
patch:

write(-1, "1\n", 2) = -1 EBADF (Bad file descriptor)


There still are some "Socket error" in the logs though, and I noticed some
servers (not all) go DOWN for a couple of seconds just after reload because
of this, before coming up again 4 seconds later (I use "inter 2000 rise 2").
But at least the system recovers properly from this situation and seems to
stay stable afterwards.

I can't see any obvious close() in the strace log which would be causing
trouble (un)fortunately. I'll send the whole log to you privately, so you
can have a look.


26356 11:50:48.780107 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, 
sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 23 10:50:48 ", 21}, {"haproxy", 
17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"toto_blue/toto_blue0 changed 
its IP from  to 10.10.4.45 by haproxy-dns/dns-server.", 82}, {"\n", 1}], 
msg_controllen=0, msg_
flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 127
26356 11:50:48.780244 getpid()          = 10
26356 11:50:48.780307 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, 
sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 23 10:50:48 ", 21}, {"haproxy", 
17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue0 
administratively READY thanks to valid DNS answer.", 78}, {"\n", 1}], 
msg_controllen=0, msg_flag
s=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 123
26356 11:50:48.780626 getpid()          = 10

[...]

26358 11:50:48.818887 recvfrom(16, 
"L\324\201\203\0\1\0\0\0\1\0\1\tprivate-1\nbackoffice\4stag\16infrastructure\4toto\5tld\0\0\34\0\1\300&\0\6\0\1\0\0\0\27\0I\7ns-1155\tawsdns-16\3org\0\21awsdns-hostmaster\6amazon\3com\0\0\0\0\1\0\0\34
 \0\0\3\204\0\22u\0\0\1Q\200\0\0)\20\0\0\0\0\0\0\0", 513, 0, NULL, NULL) = 165
26358 11:50:48.818981 recvfrom(16, 
"?M\201\203\0\1\0\0\0\1\0\1\tprivate-2\nbackoffice\4stag\16infrastructure\4toto\5tld\0\0\34\0\1\300&\0\6\0\1\0\0\0008\0I\7ns-1155\tawsdns-16\3org\0\21awsdns-hostmaster\6amazon\3com\0\0\0\0\1\0\0\34
 \0\0\3\204\0\22u\0\0\1Q\200\0\0)\20\0\0\0\0\0\0\0", 513, 0, NULL, NULL) = 165
26358 11:50:48.819064 recvfrom(16, 
"Ls\201\203\0\1\0\0\0\1\0\1\tprivate-3\nbackoffice\4stag\16infrastructure\4toto\5tld\0\0\34\0\1\300&\0\6\0\1\0\0\0\27\0I\7ns-1155\tawsdns-16\3org\0\21awsdns-hostmaster\6amazon\3com\0\0\0\0\1\0\0\34
 \0\0\3\204\0\22u\0\0\1Q\200\0\0)\20\0\0\0\0\0\0\0", 513, 0, NULL, NULL) = 165
26358 11:50:48.819146 recvfrom(16, 0x7f05f1433c10, 513, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable)
26356 11:50:48.820330 getpid()          = 10
26356 11:50:48.820410 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, 
sun_path="/dev/log"}, msg_iov(8)=[{"<145>Jan 23 10:50:48 ", 21}, {"haproxy", 
17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue0 is 
DOWN, reason: Socket error, check duration: 0ms. 31 active and 1 backup servers 
left. 0 sessions active, 0 requeued, 0 remaining in queue.", 169}, {"\n", 1}], 
msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 214
26356 11:50:48.820591 getpid()          = 10
26356 11:50:48.820662 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, 
sun_path="/dev/log"}, msg_iov(8)=[{"<145>Jan 23 10:50:48 ", 21}, {"haproxy", 
17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue1 is 
DOWN, reason: Socket error, check duration: 0ms. 30 active and 1 backup servers 
left. 0 sessions active, 0 requeued, 0 remaining in queue.", 169}, {"\n", 1}], 
msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 214
26356 11:50:48.820802 getpid()          = 10

[...]

26357 11:50:52.704807 sendmsg(19, {msg_name(110)={sa_family=AF_LOCAL, 
sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 23 10:50:52 ", 21}, {"haproxy", 
17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue0 is 
UP, reason: Layer7 check passed, code: 200, info: \"OK\", check duration: 16ms. 
1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.", 
185}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 
230


Thanks again for your help !

Marc

Reply via email to