Date: Fri, 12 Jul 2024 11:48:15 -0400 From: Chet Ramey <chet.ra...@case.edu> Message-ID: <258bcd3a-a936-4751-8e24-916fbeb9c...@case.edu>
| Not really, since the original intent was to wait for the *next* process | to terminate. There are two issues with that. The first is "next after what", one interpretation would be "the next after the last which was waited upon" (one way or another). The other, and the one you seem to imply, is "next which terminates after now" - ie: still running when the wait command is executed. But that's an obvious race condition, that's the second issue, as there is no possible way to know (in the script which is executing "wait -n") which processes have terminated at that instant. Eg: let's assume I have two running bg jobs, one which is going to take a very long time, the other which will finish fairly soon. For this e-mail, I'll emulate those two with just "sleep", though one of them might be a rebuild of firefox, and all its dependencies, from sources (yes, including rust), which will take some time, and the other is a rebuild of "true" (/bin/true not the builtin), which probably won't, as an empty executable file is all that's required. So, and assuming an implementation of sleep which accepts fractional seconds: sleep $(( 5 * 24 * 60 * 60 )) & J1=$! sleep 0.01 & J2=$! printf 'Just so the shell is doing something: jobs are %s & %s\n' \ "${J1}" "${J2}" wait -n Now which of the two background jobs is that waiting for? Which do you expect the script writer intended to wait for? You can make the 2nd sleep be "sleep 0" if you want to do a more reasonable test, just make sure when you test, to get a valid result, you don't interrupt that wait. The current implementation is lunacy, cannot possibly have any users, since without doing a wait the script cannot possibly know what has finished already, so can't possibly be explicitly excluding jobs which just happen to have finished after the last "wait -n" (or other wait). Of course, in the above simple example, the wait -n could be replaced by wait "${J2}" which would work just fine, but a real example would probably have many running jobs, some of which are very quick, and others which aren't, and some arbitrary ones of the quick ones might be so quick that they are finished before the script is ready to wait. Even a firefox build might be that quick, if the options passed to the top level make happen to contain a make syntax error, and so all that happens is an error message (Usage:...) and very quick exit. Please just change this, use the first definition of "next job to finish" - and in the case when there are already several of them, pick one, any one - you could order them by the time that bash reaped the jobs internally, but there's no real reason to do so, as that isn't necessarily the order the actual processes terminated, just the order the kernel picked to answer the wait() sys call, when there are several child zombies ready to be reaped. | > Bash is already tracking the pids for all child processes not waited | > on, internally. So I imagine it wouldn't be too much work to make that | > information available to the script it's running. | | So an additional feature request. If it helps, to perhaps provide some consistency, the NetBSD shell has a builtin: jobid [-g|-j|-p] [job] With no flags, print the process identifiers of the processes in the job. (-g instead gives the process group, -j the job identifier (%n), and -p the lead pid (that which was $! when the job was started, which might also be the process group, but also might not be). The "job" arg (which defaults to '%%') can identify the job by any of the methods that wait, or kill, or "fg" (etc) allow, that us %% %- %+ %string or a pid ($!)). Just one "job" arg, and only one option allowed, so there's no temptation (nor requirement) to attempt to write sh code to parse the output and work out what is what. It's a builtin, running it multiple times is cheaper than any parse attempt could possibly be. jobid exits with status 2 if there is an argument error, status 1, if with -g the job had no separate process group, or with -p there is no process group leader (should not happen), and otherwise exits with status 0. ("argument error" includes both things like giving 2 options, or an invalid (unknown) one, or giving a job arg that doesn't resolve to a current (running, stopped, or terminated but unwaited) job. Job control needs to be enabled (rare in scripts) to get separate process groups. The "process group leader" is just $! - has no particular relationship with actual process groups (and yes, the wording could be better). That command can be run after each job is created, using $! as the job arg, and saving the pids, and/or job number (for later execution when needed) however the script likes, Much the same info is also available using jobs -l, but that is hard to parse, and has the side effect of also waiting on any jobs which happen to have already terminated, requiring even more parsing to extract the status, so isn't a practical solution. In a shell with job control enabled: sleep 3 | sleep 4 | sleep 5 & echo '$! is' $!; \ jobid $!; jobid -g $!; jobid -p $!; jobid -j $! $! is 6862 816 8140 6862 816 6862 %1 And yes, it works inside a sub-shell, including command substitutions, until some other job (foreground or background, but not builtin) is started in that subshell environment, so something like eval $( printf 'pids="'; jobid $!; printf '"; job='; jobid -j $1 ) works: sleep 3 | sleep 4 | sleep 5 & eval $( printf 'pids="'; jobid $!; printf '"; job='; jobid -j $1 ) echo "PID=$! JOB=${job} PIDS='${pids}'" PID=8541 JOB=%2 PIDS='18420 27242 8541' kre