On Fri, Feb 07, 2003 at 08:18:20AM +0100, Ronald Bultje wrote:
> Hi Brian,

Hi Ronald.

> it writes q\n, which quits lavplay nicely.

Right.

> After that, lavplay should be
> long gone when do_real_exit() is called.

Should be.  But it might lag while it waits for disk i/o or something.

> If not, then something is
> surely wrong and we'd better kill lavplay as evilly
                                               ^^^^^^
I disagree with this "evilly".  lavplay might just be waiting for disk
access.  Killing a process "evilly" just because it is waiting it's
turn at disk (or some other resource) is not good process management.
But this is not the real issue.

> as possible to
> prevent zombie processes, so we kill -9 it.

This is the wrong approach.  If you have to do this, then there is a
bug in lavplay.  Why not address the real problem instead of hacking
around it?

> Normally, the kill -9 won't do anything because there won't be any open
> child processes.

Wrong.  You did not read all of my message carefully.  I will explain
again.  When you issue the "kill(0,9)" you are not just killing
lavplay but every process in the same process group!

This is not just the child processes of glav, but also glav's parent
and any children of that parent that have not issued their own
setpgrp()!  To show you an example of what that means (I do not mean
to sound condecending, but you didn't seem to understand the
ramifications when I explained them in my last message):

# ps -eo pid,ppid,pgrp,tty,args | less
  PID  PPID  PGRP TT       COMMAND
    1     0     0 ?        init
    2     1     1 ?        [keventd]
    3     1     1 ?        [ksoftirqd_CPU0]
    4     1     1 ?        [kswapd]
 2260     1  2260 ?        gnome-terminal ...
...
 2311  2260  2311 pts/1    bash
...
17403  2311 17403 pts/1    /bin/bash /home/brian/bin/edlit test.mov
17410 17403 17403 pts/1    glav --size 640x480 test.mov
17411 17410 17403 pts/1    lavplay -q -g --size 640x480 test.mov
17412 17411 17403 pts/1    lavplay -q -g --size 640x480 test.mov
17413 17412 17403 pts/1    lavplay -q -g --size 640x480 test.mov
17414 17412 17403 pts/1    lavplay -q -g --size 640x480 test.mov
17415 17412 17403 pts/1    lavplay -q -g --size 640x480 test.mov
...

As you can see, glav was started by a shell script (PID 17403), which
was started by an interactive shell (PID 2311), which was started by a
"gnome-terminal" (PID 2260).

When the interactive shell starts (PID 2311) it starts a new "process
group", in this case, process group 2311.  All processes that inherit
from that shell, unless they issue a setpgrp(), will have the same
process group id.

But notice that the shell script that started glav created a new
process group (PGRP 17403), but glav _did_not_.  So when glav issues a
kill(0,9) every process with PGRP == 17403 will be killed with a
SIGKILL, and that includes the process that started glav (and any
other processes that that shell script may have started -- none in
this example)!

glav has no business killing any processes except those that it
spawned.  It should really only kill (nothing IMHO, but if you
insist,) the children it knows it's responsible for, which is "pid"
(the same "pid" that glav does a "waitpid" on directly after killing
it).

If you really want to keep the kill(), please either:

a. change the first argument of kill() from 0 to pid or
b. if you sill want that "wide sweeping" kill, at least have glav
   create a new process group with setpgrp(), or
c. alternatively change the first argument of kill() to -1 or
d. just use wait().

> I'd call this "not a bug, but a feature".

No, it's not a feature.  Having glav kill it's parent and it's
parent's children is a bug.

b.

-- 
Brian J. Murrell

Attachment: msg00630/pgp00000.pgp
Description: PGP signature

Reply via email to