Hi all:

I've got a problem with a users' MPI job.  This code is in use on
dozzens of clusters around the world, but for some reason, when run on
my Rocks 4.3 cluster, it dies at random timesteps.  The logs are quite
unhelpful:

[root@aeolus logs]# more 2047.aeolus.OU
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
data directory is  /mnt/pvfs2/patton/data/chem/aa1
exec directory is  /mnt/pvfs2/patton/exec/chem/aa1
arch directory is  /mnt/pvfs2/patton/data/chem/aa1
mpirun: killing job...

Terminated
--------------------------------------------------------------------------
WARNING: mpirun is in the process of killing a job, but has detected an
interruption (probably control-C).

It is dangerous to interrupt mpirun while it is killing a job (proper
termination may not be guaranteed).  Hit control-C again within 1
second if you really want to kill mpirun immediately.
--------------------------------------------------------------------------
[compute-0-0.local:03444] OOB: Connection to HNP lost

We've been trying to figure out what's going on....We've tried
different datasets, different nodes, different numbers of processors.
We started on OpenMPI 1.2.4 and upgraded to 1.2.6, with no change.
We've connected the compute node to the head node directly (bypassing
the switch, etc.) with no change.  It doesn't matter where the data is
stored...  If we run with nodes=1 (single threaded, single cpu), then
it runs through to completion.

The only clue we've found happened this morning:  If we run the job
directly with mpirun (torque has no knowledge), it runs fine.  But
submit it through torque+maui, and it dies as above.

I'm at a loss at this point as to how to troubleshoot this further.
Is there a way to get more details on torque about this?  Turn up
logging?  Any known issues that might effect this?  I have about a
dozzen users running on the cluster, all using the scheduler, about
half of which are MPI (and some are using nearly the entire cluster on
a run), all without any such problems.  Any suggestions?

--Jim

Reply via email to