Re: [OMPI users] OpenMPI exits when subsequent tail -f in script is interrupted

Ralph Castain Sat, 23 Apr 2011 11:15:24 -0400

On Apr 23, 2011, at 9:07 AM, Pablo Lopez Rios wrote:

>> what about:
>> ( trap "" sigint; exec mpiexec ...)&
> 
> Yup, that's included in the workarounds I tried. Tried again with your 
> specific suggestion; no luck.
> 
>> Well, maybe mpiexec is adjusting it on its own
>> again. This can be checked in /proc/<pid>/status
> 
> The signal masks in /proc/$!/status are:
> 
> nompi (bash):
> SigBlk: 0000000000010000 ->  16 blocked
> SigIgn: 0000000000000006 ->  1,2 ignored
> SigCgt: 0000000000010000 ->  16 caught
> 
> mpi (mpirun):
> SigBlk: 0000000000000000 ->  none blocked
> SigIgn: 0000000000000004 ->  2 ignored
> SigCgt: 0000000180015ee2 ->  1,5,6,7,9,10,11,12,14,16,31,32 caught
> 
> I think I'm off by one in interpreting the above masks (for instance I would 
> expect signals 30 and 31 to be caught, not 31 and 32), but I'm already 
> assuming that the least significant bit is "signal 0"; assuming it is "signal 
> 1" would just worsen the values.
> 
> Anyway, why does mpirun bypass the traps I try to set and how do I stop it 
> doing so?


You can't - this is a design requirement for clean termination of MPI jobs when 
the user interrupts execution.


> 
> Thanks,
> Pablo
> 
> On 23/04/11 13:20, Reuti wrote:
>> Hi,
>> 
>> Am 23.04.2011 um 04:31 schrieb Pablo Lopez Rios:
>> 
>>> I'm having a bit of a problem with wrapping mpirun in a script. The script 
>>> needs to run an MPI job in the background and tail -f the output. Pressing 
>>> Ctrl+C should stop tail -f, and the MPI job should continue. However mpirun 
>>> seems to detect the SIGINT that was meant for tail, and kills the job 
>>> immediately. I've tried workarounds involving nohup, disown, trap, 
>>> subshells (including calling the script from within itself), etc, to no 
>>> avail.
>>> 
>>> The problem is that this doesn't happen if I run the command directly 
>>> instead, without mpirun. Attached is a script that reproduces the problem. 
>>> It runs a simple counting script in the background which takes 10 seconds 
>>> to run, and tails the output. If called with "nompi" as first argument, it 
>>> will simply run bash -c "$SCRIPT">&  "$out"&, and with "mpi" it will do the 
>>> same with 'mpirun -np 1' prepended. The output I get is:
>> what about:
>> 
>> ( trap "" sigint; exec mpiexec ...)&
>> 
>> i.e. replace the subshell with changed interrupt handling with the mpiexec. 
>> Well, maybe mpiexec is adjusting it on its own again. This can be checked in 
>> /proc/<pid>/status
>> 
>> -- Reuti
>> 
>>> $ ./ompi_bug.sh mpi
>>> mpi:
>>> 1
>>> 2
>>> 3
>>> 4
>>> ^C
>>> $ ./ompi_bug.sh nompi
>>> nompi:
>>> 1
>>> 2
>>> 3
>>> 4
>>> ^C
>>> $ cat output.*
>>> mpi:
>>> 1
>>> 2
>>> 3
>>> 4
>>> mpirun: killing job...
>>> 
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 1222 on node pablomme exited on 
>>> signal 0 (Unknown signal 0).
>>> --------------------------------------------------------------------------
>>> mpirun: clean termination accomplished
>>> 
>>> nompi:
>>> 1
>>> 2
>>> 3
>>> 4
>>> 5
>>> 6
>>> 7
>>> 8
>>> 9
>>> 10
>>> Done
>>> 
>>> 
>>> This convinces me that there is something strange with OpenMPI, since I 
>>> expect no difference in signal handling when running a simple command with 
>>> or without mpirun in the middle.
>>> 
>>> I've tried looking for options to change this behaviour, but I don't seem 
>>> to find any. Is there one, preferably in the form of an environment 
>>> variable? Or is this a bug?
>>> 
>>> I'm using OpenMPI v1.4.3 as distributed with Ubuntu 11.04, and also v1.2.8 
>>> as distributed with OpenSUSE 11.3.
>>> 
>>> Thanks,
>>> Pablo
>>> <ompi_bug.sh.gz>_______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI exits when subsequent tail -f in script is interrupted

Reply via email to