On May 27, 2008, at 11:47 AM, Jim Kusznir wrote:
I have updated to OpenMPI 1.2.6 and had the user rerun his jobs. He's
getting similar output:
[root@aeolus logs]# more 2047.aeolus.OU
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
data directory is /mnt/pvfs2/patton/data/chem/aa1
exec directory is /mnt/pvfs2/patton/exec/chem/aa1
arch directory is /mnt/pvfs2/patton/data/chem/aa1
mpirun: killing job...
FWIW: this message ("mpirun: killing job...") *only* displays if
mpirun catches a SIGINT or SIGTERM.
This seems quite fishy; I seem to recall that torque sends a TERM at
T-30 seconds before the job's wallclock time runs out. Can you do a
stupid test? Replace the "mpirun..." with some other command --
perhaps a short C program that outputs a line every N seconds or
something, just so that you can see continued progress. See if it
dies (or catches a SIGINT or SIGTERM) in about the same amount of time
that mpirun typically dies.
--
Jeff Squyres
Cisco Systems