On 03/01/10 11:51, Ralph Castain wrote:
On Mar 1, 2010, at 8:41 AM, David Turner wrote:

On 3/1/10 1:51 AM, Ralph Castain wrote:
Which version of OMPI are you using? We know that the 1.2 series was unreliable 
about removing the session directories, but 1.3 and above appear to be quite 
good about it. If you are having problems with the 1.3 or 1.4 series, I would 
definitely like to know about it.
Oops; sorry!  OMPI 1.4.1, compiled with PGI 10.0 compilers,
running on Scientific Linux 5.4, ofed 1.4.2.

The session directories are *frequently* left behind.  I have
not really tried to characterize under what circumstances they
are removed. But please confirm:  they *should* be removed by
OMPI.

Most definitely - they should always be removed by OMPI. This is the first 
report we have had of them -not- being removed in the 1.4 series, so it is 
disturbing.

What environment are you running under? Does this happen under normal 
termination, or under abnormal failures (the more you can tell us, the better)?



Hi Ralph:

It turns out that I am seeing session directories left behind as well with v1.4 (r22713) I have not tested any other versions. I believe there are two elements that make this reproducible.
1. Run across 2 or more nodes.
2. CTRL-C out of the MPI job.

Then take a look at the remote nodes and you may see a leftover session directory. The mpirun node seems to be clean.

Here is an example using two nodes. I also added some sleeps to the ring_c program to slow things down so I could hit CTRL-C.

First, tmp directories are empty:
[rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv*
ls: No match.
[rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv*
ls: No match.

Now run test:
[rolfv@burl-ct-x2200-6 ~/examples]$ mpirun -np 4 -host burl-ct-x2200-6,burl-ct-x2200-6,burl-ct-x2200-7,burl-ct-x2200-7 ring_slow_c
Process 0 sending 10 to 1, tag 201 (4 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
mpirun: killing job...

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3002 on node burl-ct-x2200-6 exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
4 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished

[burl-ct-x2200-6:02990] 2 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix

Now check tmp directories:
[rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv* ls: No match.
[rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv*
total 8
drwx------ 3 rolfv hpcgroup 4096 Mar  1 17:27 20007/

Rolf

--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

Reply via email to