I have OpenMPI running fine for a small/medium number of tasks (simple
hello or cpi program). But when I try 700 or 800 tasks, it hangs,
apparently on startup. I think this might be related to LDAP, since if I
try to log into my account while the job is hung, I get told my username
doesn't exist. However, I tried adding -debug to the mpirun, and got the
same sequence of output as for successful smaller runs, until it hung
again. So I added --debug-daemons and got this (with an exit, i.e. no
hanging):

...

[blade1:31733] [0,0,0] wrote setup file

------------------------------------------------------------------------
--

The rsh launcher has been given a number of 128 concurrent daemons to

launch and is in a debug-daemons option. However, the total number of

daemons to launch (200) is greater than this value. This is a scenario
that

will cause the system to deadlock.

 

To avoid deadlock, either increase the number of concurrent daemons, or

remove the debug-daemons flag.

------------------------------------------------------------------------
--

[blade1:31733] [0,0,0] ORTE_ERROR_LOG: Fatal in file
../../../../../orte/mca/rmgr/urm/

rmgr_urm.c at line 455

[blade1:31733] mpirun: spawn failed with errno=-6

[blade1:31733] sess_dir_finalize: proc session dir not empty - leaving

 

Any ideas or suggestions appreciated.

 

Todd Heywood

 

 

 

 

Reply via email to