We have still have a problem with PR #10971 here running a -STABLE as of last
week. (Long since 10971 should have been dead). It is a difficult problem
to track down as stack corruption makes debugging files less than useless.
I do, however, have a ktrace of an entire transaction that causes ypserv
to die. I am in the process of trying to track down why it is dying, it
appears to be a bug in the rpc library itself. Normally what happens is
the following:
# TCP request comes in, accept().
# yp_all request issued, parent forks.
# child handles request, quits.
# parent is interrupted in its select() call, dispatches to signal handler for
# SIGCHLD
# handler returns.
# parent issues a read?!? (this is odd, since it doesn't re-enter the select
# loop as the code I have read suggests it should).
# read fails (0 bytes returned)
# it does that a couple of times (probably falling out of loops), and FD is
# closed
# ypserv re-enters the select loop
Under a failure condition the following happens:
# Upon child return parent reads from a a DB file to a nonexistent buffer.
# parent seg-faults.
I believe the problem code is "next to" the section of the code where it
selects(), and then accepts() if it is a TCP connection... but I cannot find
where this code is. a grep of 'accept' in both the ypserv and rpc code
returns no usefull matches. Also, it would certainly appear that there
is another select loop than just the one in the the canonical ypsrever.
Below is the dying moments for the parent process as reported by ktrace,
ideas?
41096 ypserv CALL fork
41096 ypserv RET fork 62356/0xf394
41096 ypserv CALL gettimeofday(0xbfbff510,0)
41096 ypserv RET gettimeofday 0
41096 ypserv CALL select(0x10,0x8051040,0,0,0xbfbff518)
41096 ypserv PSIG SIGCHLD caught handler=0x804c75c mask=0x0 code=0x0
41096 ypserv RET select -1 errno 4 Interrupted system call
41096 ypserv CALL wait4(0xffffffff,0xbfbff308,0x1,0)
41096 ypserv RET wait4 62356/0xf394
41096 ypserv CALL wait4(0xffffffff,0xbfbff308,0x1,0)
41096 ypserv RET wait4 -1 errno 10 No child processes
41096 ypserv CALL sigreturn(0xbfbff328)
41096 ypserv RET sigreturn JUSTRETURN
41096 ypserv CALL gettimeofday(0xbfbff510,0)
41096 ypserv RET gettimeofday 0
41096 ypserv CALL read(0x1c,0x80f3fa0,0xfa0)
41096 ypserv GIO fd 28 read 4000 bytes
41096 ypserv RET read 4000/0xfa0
41096 ypserv PSIG SIGSEGV SIG_DFL
41096 ypserv NAMI "ypserv.core"
Oh, this is true of all systems, not just 4.0-STABLE. I was hoping the move
to 4.0 might solve the problem, so I wasn't actively trying to debug it before.
--
David Cross | email: [EMAIL PROTECTED]
Lab Director | Rm: 308 Lally Hall
Rensselaer Polytechnic Institute, | Ph: 518.276.2860
Department of Computer Science | Fax: 518.276.4033
I speak only for myself. | WinNT:Linux::Linux:FreeBSD
To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message