Hi all: I've got a problem with a users' MPI job. This code is in use on dozzens of clusters around the world, but for some reason, when run on my Rocks 4.3 cluster, it dies at random timesteps. The logs are quite unhelpful:
[root@aeolus logs]# more 2047.aeolus.OU Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. data directory is /mnt/pvfs2/patton/data/chem/aa1 exec directory is /mnt/pvfs2/patton/exec/chem/aa1 arch directory is /mnt/pvfs2/patton/data/chem/aa1 mpirun: killing job... Terminated -------------------------------------------------------------------------- WARNING: mpirun is in the process of killing a job, but has detected an interruption (probably control-C). It is dangerous to interrupt mpirun while it is killing a job (proper termination may not be guaranteed). Hit control-C again within 1 second if you really want to kill mpirun immediately. -------------------------------------------------------------------------- [compute-0-0.local:03444] OOB: Connection to HNP lost We've been trying to figure out what's going on....We've tried different datasets, different nodes, different numbers of processors. We started on OpenMPI 1.2.4 and upgraded to 1.2.6, with no change. We've connected the compute node to the head node directly (bypassing the switch, etc.) with no change. It doesn't matter where the data is stored... If we run with nodes=1 (single threaded, single cpu), then it runs through to completion. The only clue we've found happened this morning: If we run the job directly with mpirun (torque has no knowledge), it runs fine. But submit it through torque+maui, and it dies as above. I'm at a loss at this point as to how to troubleshoot this further. Is there a way to get more details on torque about this? Turn up logging? Any known issues that might effect this? I have about a dozzen users running on the cluster, all using the scheduler, about half of which are MPI (and some are using nearly the entire cluster on a run), all without any such problems. Any suggestions? --Jim