Hi,

We are invoking mpirun from within a script which installs some signal 
handlers. Now, if we abort an Open MPI run with CTRL+C, the system sends SIGINT 
to the entire process group. Hence, the mpirun process receives a SIGINT from 
the system with si_code=SI_KERNEL. Additionally, our own signal handler 
intercepts SIGINT, does some clean up, and sends the SIGINT further to the 
mpirun process with si_code=SI_USER. Consequently, mpirun receives 2x SIGINT. 
This leads to unclean termination with Open MPI 4.0.3. While it does not leave 
behind any zombie processes, killing it in the described way leads to leftover 
vader shared memory segment files in /dev/shm (a known issue with Open MPI 3, 
but supposedly resolved in Open MPI 4). Also, strace shows that the mpirun 
process does not receive any SIGCHILD.

If we remove our own signal handler (which is not our preferred option), mpirun 
receives only a single SIGINT and n times SIGCHILD (n is the number of 
processes). Also, this leads to correct clean up of vader shared memory segment 
files.

Is it expected that the cleanup fails when mpirun receivs multiple signals at 
the same time? If yes, is the only way to guarantee proper clean up to always 
make sure that only a single signal gets propagated to mpirun?



Thanks,
Moritz

--
Moritz Kreutzer

Siemens Digital Industries Software
Simulation and Test Solutions, Product Development, High Performance Computing
Nordostpark 3
90411 Nuremberg, Germany
Tel.: +49 (911) 38379 8085
moritz.kreut...@siemens.com<mailto:moritz.kreut...@siemens.com>
www.sw.siemens.com<http://www.sw.siemens.com/>



-----------------
Siemens Industry Software GmbH; Anschrift: Franz-Geuer-Str. 10, 50823 Köln; 
Gesellschaft mit beschränkter Haftung; Geschäftsführer: Dr. Erich Bürgel, 
Alexander Walter; Sitz der Gesellschaft: Köln; Registergericht: Amtsgericht 
Köln, HRB 84564; Vorsitzender des Aufsichtsrats: Jürgen Köhler

Reply via email to