On 12/11/2011 21:41, Guillem Jover wrote:
On Sat, 2011-11-12 at 20:41:47 +0000, Martin Townsend wrote:
I have looked through the code some more and see that what I am
trying to do is wrong and that you need to find the pid with check.
The first pass of the schedule will send the SIGTERM. Then during
the schedule for timeout do_stop is called with a signal of 0 so
that the call to kill will check for the existence of the pid that
was retrieved from the pidfile.
Exactly.
This must be where it is failing for me, kill must be returning 0
even though the pidfile has gone and ps --ef is showing that sshd
has a new process id.
do_pidfile() should be caching the pid read initially. Something that
comes to mind is that if the match options are not specific enough,
let's say only --pidfile was used, there's the hypotetical problem of
pid reuse, but openssh-server uses --exec too, and I doubt in your
case the kernel has reused the sshd pid in that short time.
I'm running dpkg version 1.14.31 but I'll try the latest code and
step through it with gdb to try and shed some light on whats going
on.
Starting with strace might give some fast clue, w/o the need of a full
gdb session.
Could you point me to the git master and I'll check it out on Monday?
$ git clone git://git.debian.org/git/dpkg/dpkg.git
thanks,
guillem
Hi,
Running strace confirmed that kill was returning 0.
0.020789 gettimeofday({1321268190, 723734}, NULL) = 0 <0.000078>
0.000712 kill(2865, SIG_0) = 0 <0.000082>
0.000353 gettimeofday({1321268190, 724799}, NULL) = 0 <0.000078>
0.000379 select(0, NULL, NULL, NULL, {0, 20000}) = 0 (Timeout)
<0.020405>
0.020786 gettimeofday({1321268190, 745965}, NULL) = 0 <0.000080>
0.000630 kill(2865, SIG_0) = 0 <0.000083>
0.000347 gettimeofday({1321268190, 746939}, NULL) = 0 <0.000076>
0.000376 select(0, NULL, NULL, NULL, {0, 20000}) = 0 (Timeout)
<0.020183>
0.020573 gettimeofday({1321268190, 767890}, NULL) = 0 <0.000078>
0.000399 kill(2865, SIG_0) = 0 <0.000302>
0.000577 gettimeofday({1321268190, 768864}, NULL) = 0 <0.000078>
0.000390 kill(2865, SIGKILL) = 0 <0.001457>
0.001810 gettimeofday({1321268190, 771069}, NULL) = 0 <0.000300>
0.000606 gettimeofday({1321268190, 771667}, NULL) = 0 <0.000079>
0.000388 kill(2865, SIG_0) = -1 ESRCH (No such process)
<0.000189>
0.000906 exit_group(0) = ?
It looks like we have a wider problem with signals being blocked. I
tried a simple kill -SIGTERM sshd_pid and it completely ignored it. We
have also seen the same problem with our own application and we had to
explicitly unblock SIGTERM and SIGINT which I didn't think you needed to
do. Looking at the status for ssh pid in /proc it looks like SIGTERM
SIGINT and SIGHUP are blocked, here's the output.
SigPnd: 0000000000000000
ShdPnd: 0000000000004000
SigBlk: 0000000000004003
SigIgn: 0000000000001000
SigCgt: 0000000180014005
So I'm happy for you close this as invalid.
BTW I couldn't build the dpkg source from git as in Lenny the version of
gettext is to low.
Many Thanks,
Martin.
--
Martin Townsend
Power*Oasis*