I have a question about using OpenMPI and Torque on stateless nodes. I have compiled openmpi 1.4.3 with --with-tm=/usr/local --without-slurm using intel compiler version 11.1.075.
When I run a simple "hello world" mpi program, I am receiving the following error. [node164:11193] plm:tm: failed to poll for a spawned daemon, return status = 17002 -------------------------------------------------------------------------- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- node163 - daemon did not report back when launched node159 - daemon did not report back when launched node158 - daemon did not report back when launched node157 - daemon did not report back when launched node156 - daemon did not report back when launched node155 - daemon did not report back when launched node154 - daemon did not report back when launched node152 - daemon did not report back when launched node151 - daemon did not report back when launched node150 - daemon did not report back when launched node149 - daemon did not report back when launched But if I include: -mca plm rsh The job runs just fine. I am not sure what the problem is with torque or openmpi that prevents the process from launching on remote nodes. I have posted to the torque list and someone suggested that it may be temporary directory space that can be causing issues. I have 100MB allocated to /tmp Any ideas as to why I am having this problem would be appreciated. -- Randall Svancara http://knowyourlinux.com/