I have a question about using OpenMPI and Torque on stateless nodes.
I have compiled openmpi 1.4.3 with --with-tm=/usr/local
--without-slurm using intel compiler version 11.1.075.

When I run a simple "hello world" mpi program, I am receiving the
following error.

[node164:11193] plm:tm: failed to poll for a spawned daemon, return
status = 17002
 --------------------------------------------------------------------------
 A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
 launch so we are aborting.

 There may be more information reported by the environment (see above).

 This may be because the daemon was unable to find all the needed shared
 libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
 location of the shared libraries on the remote nodes and this will
 automatically be forwarded to the remote nodes.
 --------------------------------------------------------------------------
 --------------------------------------------------------------------------
 mpiexec noticed that the job aborted, but has no info as to the process
 that caused that situation.
 --------------------------------------------------------------------------
 --------------------------------------------------------------------------
 mpiexec was unable to cleanly terminate the daemons on the nodes shown
 below. Additional manual cleanup may be required - please refer to
 the "orte-clean" tool for assistance.
 --------------------------------------------------------------------------
         node163 - daemon did not report back when launched
         node159 - daemon did not report back when launched
         node158 - daemon did not report back when launched
         node157 - daemon did not report back when launched
         node156 - daemon did not report back when launched
         node155 - daemon did not report back when launched
         node154 - daemon did not report back when launched
         node152 - daemon did not report back when launched
         node151 - daemon did not report back when launched
         node150 - daemon did not report back when launched
         node149 - daemon did not report back when launched


But if I include:

-mca plm rsh

The job runs just fine.

I am not sure what the problem is with torque or openmpi that prevents
the process from launching on remote nodes.  I have posted to the
torque list and someone suggested that it may be temporary directory
space that can be causing issues.  I have 100MB allocated to /tmp

Any ideas as to why I am having this problem would be appreciated.


-- 
Randall Svancara
http://knowyourlinux.com/

Reply via email to