Re: 'wait -n' with and without id arguments

Robert Elz Thu, 17 Oct 2024 20:28:40 -0700

    Date:        Thu, 17 Oct 2024 17:14:52 -0400
    From:        Chet Ramey <chet.ra...@case.edu>
    Message-ID:  <9d279d7f-ea94-4f75-9a52-059f6d2b9...@case.edu>



  | > Maybe those defenders can elucidate what purpose that behavior
  | > would serve.
  |
  | kre's on the list, maybe he'll speak up.

Sorry, I largely gave up on this discussion a while ago now, and haven't
been following what all the current discussion is about.

I get the impression the point here is perhaps nothing specifically to
do with wait -n (despite the Subject header of this thread) and is
really about removing remembered old process info from the shell ??

My general opinion on that is that it really is (or should be) quite
simple:

If the user (or script) does any kind of "wait" command, then whatever
is processed should be removed (from everywhere) forthwith.   For this note
that "wait" is supposed to only wait for completed processes - if stopped
jobs are returned (either via an option to do that, or by default which would
not be posix compliant) then those obviously don't get removed.

If the user (or script) executes the jobs cmd (any normal use which scans the
jobs table, but not jobs -p which just extracts pids/pgrps (I don't think
the posix people understand that they're not the same thing) or anything
that has been added which is similar, then all jobs reported to the user
as "Done" should be removed.

When an interactive shell notifies the user before printing a prompt that
a job is now Done - that job should be removed (from everywhere).

What happens with set -b notifications is unspecified - and largely
irrelevant.  For interactive use, it makes no real practical difference.
A job which completes is notified asap when "notify" (-b) is set, or at
the next prompt otherwise.   The job can be cleaned up either with the
notification, or when the notification would have been done at the next
prompt - the only difference it can make is if in the user's currently
running list of commands there happens to be a "wait" or "jobs" command.
jobs would be irrelevant - it notifies and cleans up, -b has notified
already, no significant difference.   wait would be affected, so it might
be better to not clean up until the next prompt - but either is allowed.

For "set -b" in scripts, well, only an idiot would do that in something
that matters (as distinct from something designed to test how set -b works)
as having random "job finished" notifications about async jobs that the
user most probably never knew existed seems like a stunningly poor design
choice, and I can't imagine what other purpose it can have.  So for that
case I simply don't care what happens.

However one thing is clear, once wait returns status to a user/script about a
completed job/process (or otherwise deals with it, in the case of just "wait"
or "wait pid1 pid2 pid3 ... pidN" for all the pid args except pidN even though
their status is never returned - the script writer knew that would happen when
such a wait command was written) then the process is effectively gone forever.
"Effectively" as I don't mind retaining it in the jobs table, if there are
other processes in the job which have not yet completed, until the whole
job is done - but it must never be found again by a "wait" command.

Part of the point of the "jobs" command is to clean up old completed
stuff (rarely needed interactively for that purpose, but "jobs>/dev/null"
in a script can be useful, just to keep the shell's data structs smaller
if the script is never going to want the status, but nor does it want the
potential delay that doing "wait" might entail, if some jobs are still
running).   So, anything jobs reports as Done should always be removed.
No questions asked.

There's one issue that people seem to ignore in all of this ... shells
clean up zombie processes as soon as they're able (they don't leave them
hanging in the kernel until the user decides to wait/jobs to clean up).

That has two effects - first, there's no need to clean up async jobs in
order to avoid clogging the kernel's process table, or reaching the max
number of children the kernel permits a process to have - the shell has
already dealt with that issue, all that remains is shell data structure
memory, which is relatively cheap, and if an application (script) isn't
going to run for long enough to really matter, then it can never clean
up if it doesn't need to, when the script is done and the shell exits,
all its memory goes with it.   (Long running scripts, like long running
anything else, should clean up of course.)

And second, once the shell has reaped a zombie process, there's nothing
to stop the kernel from assigning the same pid to another process the
shell forks.   That is, we could have:

        cmd1 &  X=$!

        # time passes doing other work, cmd1 finishes, shell reaps zombie

        cmd2 &  Y=$!

at this point it is possible (if unlikely) that $X = $Y  (both processes
assigned the same pid by the kernel).   This means a later

        wait $X

might instead wait for cmd2 rather than cmd1 (and vice versa).   There's
not a lot that can really be done (though I have considered looking up the
result of fork() in both parent and child (via getpid() in the child) and
simply aborting that child if the pid returned is currently known to the
shell - there's really no other solution I can see to this problem (this is
easy and safe to do, as both parent and shell share the data immediately
after the fork() - each knows exactly what the other will do).  Just abandon
that fork() and do another, and hope for a better pid next time (one the
shell doesn't currently know.)

But to avoid this being likely, it is important that the shell rid itself
of old data related to now extinct pids at the earliest possible moment.
As soon as the shell has dealt with them to meet the needs of the script
or user (they've been subject of a wait or jobs) then they should be gone.
Scripts must not be encouraged to believe they can ask again later (for the
above reason) - this means the chances of a fork() needing to be abandoned
are smaller than they would be had the shell been collecting pids of its
long dead children (and what killed them) as some kind of souvenir to show
off to anyone who asks to look...

kre

ps: full disclosure: the NetBSD sh doesn't yet implement -b (it
understands the option, but does nothing with it) -- I have been
considering fixing that recently though.   I'm yet to decide (and
I don't think it is specified) whether it is the state of -b when
an async job is started, or its state when that job completes, which
matters (it cannot be both I don't think - not rationally anyway.)

Re: 'wait -n' with and without id arguments

Reply via email to