On 03/01/10 11:51, Ralph Castain wrote:
On Mar 1, 2010, at 8:41 AM, David Turner wrote:
On 3/1/10 1:51 AM, Ralph Castain wrote:
Which version of OMPI are you using? We know that the 1.2 series was unreliable
about removing the session directories, but 1.3 and above appear to be quite
good about it. If you are having problems with the 1.3 or 1.4 series, I would
definitely like to know about it.
Oops; sorry! OMPI 1.4.1, compiled with PGI 10.0 compilers,
running on Scientific Linux 5.4, ofed 1.4.2.
The session directories are *frequently* left behind. I have
not really tried to characterize under what circumstances they
are removed. But please confirm: they *should* be removed by
OMPI.
Most definitely - they should always be removed by OMPI. This is the first
report we have had of them -not- being removed in the 1.4 series, so it is
disturbing.
What environment are you running under? Does this happen under normal
termination, or under abnormal failures (the more you can tell us, the better)?
Hi Ralph:
It turns out that I am seeing session directories left behind as well
with v1.4 (r22713) I have not tested any other versions. I believe
there are two elements that make this reproducible.
1. Run across 2 or more nodes.
2. CTRL-C out of the MPI job.
Then take a look at the remote nodes and you may see a leftover session
directory. The mpirun node seems to be clean.
Here is an example using two nodes. I also added some sleeps to the
ring_c program to slow things down so I could hit CTRL-C.
First, tmp directories are empty:
[rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv*
ls: No match.
[rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv*
ls: No match.
Now run test:
[rolfv@burl-ct-x2200-6 ~/examples]$ mpirun -np 4 -host
burl-ct-x2200-6,burl-ct-x2200-6,burl-ct-x2200-7,burl-ct-x2200-7 ring_slow_c
Process 0 sending 10 to 1, tag 201 (4 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
mpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3002 on node burl-ct-x2200-6
exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
4 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished
[burl-ct-x2200-6:02990] 2 more processes have sent help message
help-mpi-btl-openib.txt / default subnet prefix
Now check tmp directories:
[rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv*
ls: No match.
[rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv*
total 8
drwx------ 3 rolfv hpcgroup 4096 Mar 1 17:27 20007/
Rolf
--
=========================
rolf.vandeva...@sun.com
781-442-3043
=========================