On Mar 21, 2011, at 11:53 AM, Randall Svancara wrote: > I am not sure if there is any extra configuration necessary for torque > to forward the environment. I have included the output of printenv > for an interactive qsub session. I am really at a loss here because I > never had this much difficulty making torque run with openmpi. It has > been mostly a good experience.
Not seeing a problem on other Torque users, so it appears to be something in your local setup. Note that running mpiexec on a single node doesn't invoke Torque at all - mpiexec just fork/execs the app processes directly. Torque is only invoked when running on multiple nodes. One thing stands out immediately. When you used rsh, you specified the tmp dir: > -mca orte_tmpdir_base /fastscratch/admins/tmp Yet you didn't do so when running under Torque. Was there a reason? > > Permissions of /tmp > > drwxrwxrwt 4 root root 140 Mar 20 08:57 tmp > > mpiexec hostname single node: > > [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12 > qsub: waiting for job 1667.mgt1.wsuhpc.edu to start > qsub: job 1667.mgt1.wsuhpc.edu ready > > [rsvancara@node100 ~]$ mpiexec hostname > node100 > node100 > node100 > node100 > node100 > node100 > node100 > node100 > node100 > node100 > node100 > node100 > > mpiexec hostname two nodes: > > [rsvancara@node100 ~]$ mpiexec hostname > [node100:09342] plm:tm: failed to poll for a spawned daemon, return > status = 17002 > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpiexec noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpiexec was unable to cleanly terminate the daemons on the nodes shown > below. Additional manual cleanup may be required - please refer to > the "orte-clean" tool for assistance. > -------------------------------------------------------------------------- > node99 - daemon did not report back when launched > > > MPIexec on one node with one cpu: > > [rsvancara@node164 ~]$ mpiexec printenv > OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea > MODULE_VERSION_STACK=3.2.8 > MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man > HOSTNAME=node164 > PBS_VERSION=TORQUE-2.4.7 > TERM=xterm > SHELL=/bin/bash > HISTSIZE=1000 > PBS_JOBNAME=STDIN > PBS_ENVIRONMENT=PBS_INTERACTIVE > PBS_O_WORKDIR=/home/admins/rsvancara > PBS_TASKNUM=1 > USER=rsvancara > LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib > LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35: > PBS_O_HOME=/home/admins/rsvancara > CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include > PBS_MOMPORT=15003 > PBS_O_QUEUE=batch > NLSPATH=/home/software/intel/Compiler/11.1/075/lib/intel64/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/ipp/em64t/lib/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/idb/intel64/locale/%l_%t/%N > MODULE_VERSION=3.2.8 > MAIL=/var/spool/mail/rsvancara > PBS_O_LOGNAME=rsvancara > PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin > PBS_O_LANG=en_US.UTF-8 > PBS_JOBCOOKIE=D52DE562B685A462849C1136D6B581F9 > INPUTRC=/etc/inputrc > PWD=/home/admins/rsvancara > _LMFILES_=/home/software/Modules/3.2.8/modulefiles/modules:/home/software/Modules/3.2.8/modulefiles/null:/home/software/modulefiles/intel/11.1.075:/home/software/modulefiles/openmpi/1.4.3_intel > PBS_NODENUM=0 > LANG=C > MODULEPATH=/home/software/Modules/versions:/home/software/Modules/$MODULE_VERSION/modulefiles:/home/software/modulefiles > LOADEDMODULES=modules:null:intel/11.1.075:openmpi/1.4.3_intel > PBS_O_SHELL=/bin/bash > PBS_SERVER=mgt1.wsuhpc.edu > PBS_JOBID=1670.mgt1.wsuhpc.edu > SHLVL=1 > HOME=/home/admins/rsvancara > INTEL_LICENSES=/home/software/intel/Compiler/11.1/075/licenses:/opt/intel/licenses > PBS_O_HOST=login1 > DYLD_LIBRARY_PATH=/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib > PBS_VNODENUM=0 > LOGNAME=rsvancara > PBS_QUEUE=batch > MODULESHOME=/home/software/mpi/intel/openmpi-1.4.3 > LESSOPEN=|/usr/bin/lesspipe.sh %s > PBS_O_MAIL=/var/spool/mail/rsvancara > G_BROKEN_FILENAMES=1 > PBS_NODEFILE=/var/spool/torque/aux//1670.mgt1.wsuhpc.edu > PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin > module=() { eval `/home/software/Modules/$MODULE_VERSION/bin/modulecmd bash > $*` > } > _=/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec > OMPI_MCA_orte_local_daemon_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559 > OMPI_MCA_orte_hnp_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559 > OMPI_MCA_mpi_yield_when_idle=0 > OMPI_MCA_orte_app_num=0 > OMPI_UNIVERSE_SIZE=1 > OMPI_MCA_ess=env > OMPI_MCA_orte_ess_num_procs=1 > OMPI_COMM_WORLD_SIZE=1 > OMPI_COMM_WORLD_LOCAL_SIZE=1 > OMPI_MCA_orte_ess_jobid=3236233217 > OMPI_MCA_orte_ess_vpid=0 > OMPI_COMM_WORLD_RANK=0 > OMPI_COMM_WORLD_LOCAL_RANK=0 > OPAL_OUTPUT_STDERR_FD=19 > > MPIExec with -mca plm rsh: > > [rsvancara@node164 ~]$ mpiexec -mca plm rsh -mca orte_tmpdir_base > /fastscratch/admins/tmp hostname > node164 > node164 > node164 > node164 > node164 > node164 > node164 > node164 > node164 > node164 > node164 > node164 > node163 > node163 > node163 > node163 > node163 > node163 > node163 > node163 > node163 > node163 > node163 > node163 > > > On Mon, Mar 21, 2011 at 9:22 AM, Ralph Castain <r...@open-mpi.org> wrote: >> Can you run anything under TM? Try running "hostname" directly from Torque >> to see if anything works at all. >> >> The error message is telling you that the Torque daemon on the remote node >> reported a failure when trying to launch the OMPI daemon. Could be that >> Torque isn't setup to forward environments so the OMPI daemon isn't finding >> required libs. You could directly run "printenv" to see how your remote >> environ is being setup. >> >> Could be that the tmp dir lacks correct permissions for a user to create the >> required directories. The OMPI daemon tries to create a session directory in >> the tmp dir, so failure to do so would indeed cause the launch to fail. You >> can specify the tmp dir with a cmd line option to mpirun. See "mpirun -h" >> for info. >> >> >> On Mar 21, 2011, at 12:21 AM, Randall Svancara wrote: >> >>> I have a question about using OpenMPI and Torque on stateless nodes. >>> I have compiled openmpi 1.4.3 with --with-tm=/usr/local >>> --without-slurm using intel compiler version 11.1.075. >>> >>> When I run a simple "hello world" mpi program, I am receiving the >>> following error. >>> >>> [node164:11193] plm:tm: failed to poll for a spawned daemon, return >>> status = 17002 >>> -------------------------------------------------------------------------- >>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to >>> launch so we are aborting. >>> >>> There may be more information reported by the environment (see above). >>> >>> This may be because the daemon was unable to find all the needed shared >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >>> location of the shared libraries on the remote nodes and this will >>> automatically be forwarded to the remote nodes. >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> mpiexec noticed that the job aborted, but has no info as to the process >>> that caused that situation. >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> mpiexec was unable to cleanly terminate the daemons on the nodes shown >>> below. Additional manual cleanup may be required - please refer to >>> the "orte-clean" tool for assistance. >>> -------------------------------------------------------------------------- >>> node163 - daemon did not report back when launched >>> node159 - daemon did not report back when launched >>> node158 - daemon did not report back when launched >>> node157 - daemon did not report back when launched >>> node156 - daemon did not report back when launched >>> node155 - daemon did not report back when launched >>> node154 - daemon did not report back when launched >>> node152 - daemon did not report back when launched >>> node151 - daemon did not report back when launched >>> node150 - daemon did not report back when launched >>> node149 - daemon did not report back when launched >>> >>> >>> But if I include: >>> >>> -mca plm rsh >>> >>> The job runs just fine. >>> >>> I am not sure what the problem is with torque or openmpi that prevents >>> the process from launching on remote nodes. I have posted to the >>> torque list and someone suggested that it may be temporary directory >>> space that can be causing issues. I have 100MB allocated to /tmp >>> >>> Any ideas as to why I am having this problem would be appreciated. >>> >>> >>> -- >>> Randall Svancara >>> http://knowyourlinux.com/ >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > Randall Svancara > http://knowyourlinux.com/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users