On May 27, 2008, at 11:47 AM, Jim Kusznir wrote:

I have updated to OpenMPI 1.2.6 and had the user rerun his jobs.  He's
getting similar output:

[root@aeolus logs]# more 2047.aeolus.OU
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
data directory is  /mnt/pvfs2/patton/data/chem/aa1
exec directory is  /mnt/pvfs2/patton/exec/chem/aa1
arch directory is  /mnt/pvfs2/patton/data/chem/aa1
mpirun: killing job...

FWIW: this message ("mpirun: killing job...") *only* displays if mpirun catches a SIGINT or SIGTERM.

This seems quite fishy; I seem to recall that torque sends a TERM at T-30 seconds before the job's wallclock time runs out. Can you do a stupid test? Replace the "mpirun..." with some other command -- perhaps a short C program that outputs a line every N seconds or something, just so that you can see continued progress. See if it dies (or catches a SIGINT or SIGTERM) in about the same amount of time that mpirun typically dies.

--
Jeff Squyres
Cisco Systems

Reply via email to