Yeah, the system admin is me lol.....and this is a new system which I am frantically trying to work out all the bugs. Torque and MPI are my last hurdles to overcome. But I have already been through some faulty infiniband equipment, bad memory and bad drives.....which is to be expected on a cluster.
I wish there was some kind of TM test tool, that would be really nice for testing. I will ping the Torque list again. Originally they forwarded me to the openmpi list. On Mon, Mar 21, 2011 at 12:29 PM, Ralph Castain <r...@open-mpi.org> wrote: > mpiexec doesn't use pbsdsh (we use a TM API), but the affect is the same. > Been so long since I ran on a Torque machine, though, that I honestly don't > remember how to set the LD_LIBRARY_PATH on the backend. > > Do you have a sys admin there whom you could ask? Or you could ping the > Torque list about it - pretty standard issue. > > > On Mar 21, 2011, at 1:19 PM, Randall Svancara wrote: > >> Hi. The pbsdsh tool is great. I ran an interactive qsub session >> (qsub -I -lnodes=2:ppn=12) and then rand the pbsdsh tool like this: >> >> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh -h node164 printenv >> PATH=/bin:/usr/bin >> LANG=C >> PBS_O_HOME=/home/admins/rsvancara >> PBS_O_LANG=en_US.UTF-8 >> PBS_O_LOGNAME=rsvancara >> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin >> PBS_O_MAIL=/var/spool/mail/rsvancara >> PBS_O_SHELL=/bin/bash >> PBS_SERVER=mgt1.wsuhpc.edu >> PBS_O_WORKDIR=/home/admins/rsvancara/TEST >> PBS_O_QUEUE=batch >> PBS_O_HOST=login1 >> HOME=/home/admins/rsvancara >> PBS_JOBNAME=STDIN >> PBS_JOBID=1672.mgt1.wsuhpc.edu >> PBS_QUEUE=batch >> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E >> PBS_NODENUM=0 >> PBS_TASKNUM=146 >> PBS_MOMPORT=15003 >> PBS_NODEFILE=/var/spool/torque/aux//1672.mgt1.wsuhpc.edu >> PBS_VERSION=TORQUE-2.4.7 >> PBS_VNODENUM=0 >> PBS_ENVIRONMENT=PBS_BATCH >> ENVIRONMENT=BATCH >> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh -h node163 printenv >> PATH=/bin:/usr/bin >> LANG=C >> PBS_O_HOME=/home/admins/rsvancara >> PBS_O_LANG=en_US.UTF-8 >> PBS_O_LOGNAME=rsvancara >> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin >> PBS_O_MAIL=/var/spool/mail/rsvancara >> PBS_O_SHELL=/bin/bash >> PBS_SERVER=mgt1.wsuhpc.edu >> PBS_O_WORKDIR=/home/admins/rsvancara/TEST >> PBS_O_QUEUE=batch >> PBS_O_HOST=login1 >> HOME=/home/admins/rsvancara >> PBS_JOBNAME=STDIN >> PBS_JOBID=1672.mgt1.wsuhpc.edu >> PBS_QUEUE=batch >> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E >> PBS_NODENUM=1 >> PBS_TASKNUM=147 >> PBS_MOMPORT=15003 >> PBS_VERSION=TORQUE-2.4.7 >> PBS_VNODENUM=12 >> PBS_ENVIRONMENT=PBS_BATCH >> ENVIRONMENT=BATCH >> >> So one thing that strikes me as bad is the LD_LIBRARY_PATH does not >> appear available. Attempted to run mpiexec like this and it fails. >> >> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh -h node163 >> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname >> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while >> loading shared libraries: libimf.so: cannot open shared object file: >> No such file or directory >> pbsdsh: task 12 exit status 127 >> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh -h node164 >> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname >> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while >> loading shared libraries: libimf.so: cannot open shared object file: >> No such file or directory >> pbsdsh: task 0 exit status 127 >> >> If this is how the openmpi processes are being launched, then it is no >> wonder they are failing and the LD_LIBRARY_PATH error message is >> indeed somewhat accurate. >> >> So the next question is how to I ensure that this information is >> available to pbsdsh? >> >> Thanks, >> >> Randall >> >> >> On Mon, Mar 21, 2011 at 11:24 AM, Randall Svancara <rsvanc...@gmail.com> >> wrote: >>> Ok, these are good things to check. I am going to follow through with >>> this in the next hour after our GPFS upgrade. Thanks!!! >>> >>> On Mon, Mar 21, 2011 at 11:14 AM, Brock Palen <bro...@umich.edu> wrote: >>>> On Mar 21, 2011, at 1:59 PM, Jeff Squyres wrote: >>>> >>>>> I no longer run Torque on my cluster, so my Torqueology is pretty rusty >>>>> -- but I think there's a Torque command to launch on remote nodes. tmrsh >>>>> or pbsrsh or something like that...? >>>> >>>> pbsdsh >>>> If TM is working pbsdsh should work fine. >>>> >>>> Torque+OpenMPI has been working just fine for us. >>>> Do you have libtorque on all your compute hosts? You should see it open >>>> on all hosts if it works. >>>> >>>>> >>>>> Try that and make sure it works. Open MPI should be using the same API >>>>> as that command under the covers. >>>>> >>>>> I also have a dim recollection that the TM API support library(ies?) may >>>>> not be installed by default. You may have to ensure that they're >>>>> available on all nodes...? >>>>> >>>>> >>>>> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote: >>>>> >>>>>> I am not sure if there is any extra configuration necessary for torque >>>>>> to forward the environment. I have included the output of printenv >>>>>> for an interactive qsub session. I am really at a loss here because I >>>>>> never had this much difficulty making torque run with openmpi. It has >>>>>> been mostly a good experience. >>>>>> >>>>>> Permissions of /tmp >>>>>> >>>>>> drwxrwxrwt 4 root root 140 Mar 20 08:57 tmp >>>>>> >>>>>> mpiexec hostname single node: >>>>>> >>>>>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12 >>>>>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start >>>>>> qsub: job 1667.mgt1.wsuhpc.edu ready >>>>>> >>>>>> [rsvancara@node100 ~]$ mpiexec hostname >>>>>> node100 >>>>>> node100 >>>>>> node100 >>>>>> node100 >>>>>> node100 >>>>>> node100 >>>>>> node100 >>>>>> node100 >>>>>> node100 >>>>>> node100 >>>>>> node100 >>>>>> node100 >>>>>> >>>>>> mpiexec hostname two nodes: >>>>>> >>>>>> [rsvancara@node100 ~]$ mpiexec hostname >>>>>> [node100:09342] plm:tm: failed to poll for a spawned daemon, return >>>>>> status = 17002 >>>>>> -------------------------------------------------------------------------- >>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to >>>>>> launch so we are aborting. >>>>>> >>>>>> There may be more information reported by the environment (see above). >>>>>> >>>>>> This may be because the daemon was unable to find all the needed shared >>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>>>> the >>>>>> location of the shared libraries on the remote nodes and this will >>>>>> automatically be forwarded to the remote nodes. >>>>>> -------------------------------------------------------------------------- >>>>>> -------------------------------------------------------------------------- >>>>>> mpiexec noticed that the job aborted, but has no info as to the process >>>>>> that caused that situation. >>>>>> -------------------------------------------------------------------------- >>>>>> -------------------------------------------------------------------------- >>>>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown >>>>>> below. Additional manual cleanup may be required - please refer to >>>>>> the "orte-clean" tool for assistance. >>>>>> -------------------------------------------------------------------------- >>>>>> node99 - daemon did not report back when launched >>>>>> >>>>>> >>>>>> MPIexec on one node with one cpu: >>>>>> >>>>>> [rsvancara@node164 ~]$ mpiexec printenv >>>>>> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea >>>>>> MODULE_VERSION_STACK=3.2.8 >>>>>> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man >>>>>> HOSTNAME=node164 >>>>>> PBS_VERSION=TORQUE-2.4.7 >>>>>> TERM=xterm >>>>>> SHELL=/bin/bash >>>>>> HISTSIZE=1000 >>>>>> PBS_JOBNAME=STDIN >>>>>> PBS_ENVIRONMENT=PBS_INTERACTIVE >>>>>> PBS_O_WORKDIR=/home/admins/rsvancara >>>>>> PBS_TASKNUM=1 >>>>>> USER=rsvancara >>>>>> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib >>>>>> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35: >>>>>> PBS_O_HOME=/home/admins/rsvancara >>>>>> CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include >>>>>> PBS_MOMPORT=15003 >>>>>> PBS_O_QUEUE=batch >>>>>> NLSPATH=/home/software/intel/Compiler/11.1/075/lib/intel64/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/ipp/em64t/lib/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/idb/intel64/locale/%l_%t/%N >>>>>> MODULE_VERSION=3.2.8 >>>>>> MAIL=/var/spool/mail/rsvancara >>>>>> PBS_O_LOGNAME=rsvancara >>>>>> PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin >>>>>> PBS_O_LANG=en_US.UTF-8 >>>>>> PBS_JOBCOOKIE=D52DE562B685A462849C1136D6B581F9 >>>>>> INPUTRC=/etc/inputrc >>>>>> PWD=/home/admins/rsvancara >>>>>> _LMFILES_=/home/software/Modules/3.2.8/modulefiles/modules:/home/software/Modules/3.2.8/modulefiles/null:/home/software/modulefiles/intel/11.1.075:/home/software/modulefiles/openmpi/1.4.3_intel >>>>>> PBS_NODENUM=0 >>>>>> LANG=C >>>>>> MODULEPATH=/home/software/Modules/versions:/home/software/Modules/$MODULE_VERSION/modulefiles:/home/software/modulefiles >>>>>> LOADEDMODULES=modules:null:intel/11.1.075:openmpi/1.4.3_intel >>>>>> PBS_O_SHELL=/bin/bash >>>>>> PBS_SERVER=mgt1.wsuhpc.edu >>>>>> PBS_JOBID=1670.mgt1.wsuhpc.edu >>>>>> SHLVL=1 >>>>>> HOME=/home/admins/rsvancara >>>>>> INTEL_LICENSES=/home/software/intel/Compiler/11.1/075/licenses:/opt/intel/licenses >>>>>> PBS_O_HOST=login1 >>>>>> DYLD_LIBRARY_PATH=/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib >>>>>> PBS_VNODENUM=0 >>>>>> LOGNAME=rsvancara >>>>>> PBS_QUEUE=batch >>>>>> MODULESHOME=/home/software/mpi/intel/openmpi-1.4.3 >>>>>> LESSOPEN=|/usr/bin/lesspipe.sh %s >>>>>> PBS_O_MAIL=/var/spool/mail/rsvancara >>>>>> G_BROKEN_FILENAMES=1 >>>>>> PBS_NODEFILE=/var/spool/torque/aux//1670.mgt1.wsuhpc.edu >>>>>> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin >>>>>> module=() { eval `/home/software/Modules/$MODULE_VERSION/bin/modulecmd >>>>>> bash $*` >>>>>> } >>>>>> _=/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec >>>>>> OMPI_MCA_orte_local_daemon_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559 >>>>>> OMPI_MCA_orte_hnp_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559 >>>>>> OMPI_MCA_mpi_yield_when_idle=0 >>>>>> OMPI_MCA_orte_app_num=0 >>>>>> OMPI_UNIVERSE_SIZE=1 >>>>>> OMPI_MCA_ess=env >>>>>> OMPI_MCA_orte_ess_num_procs=1 >>>>>> OMPI_COMM_WORLD_SIZE=1 >>>>>> OMPI_COMM_WORLD_LOCAL_SIZE=1 >>>>>> OMPI_MCA_orte_ess_jobid=3236233217 >>>>>> OMPI_MCA_orte_ess_vpid=0 >>>>>> OMPI_COMM_WORLD_RANK=0 >>>>>> OMPI_COMM_WORLD_LOCAL_RANK=0 >>>>>> OPAL_OUTPUT_STDERR_FD=19 >>>>>> >>>>>> MPIExec with -mca plm rsh: >>>>>> >>>>>> [rsvancara@node164 ~]$ mpiexec -mca plm rsh -mca orte_tmpdir_base >>>>>> /fastscratch/admins/tmp hostname >>>>>> node164 >>>>>> node164 >>>>>> node164 >>>>>> node164 >>>>>> node164 >>>>>> node164 >>>>>> node164 >>>>>> node164 >>>>>> node164 >>>>>> node164 >>>>>> node164 >>>>>> node164 >>>>>> node163 >>>>>> node163 >>>>>> node163 >>>>>> node163 >>>>>> node163 >>>>>> node163 >>>>>> node163 >>>>>> node163 >>>>>> node163 >>>>>> node163 >>>>>> node163 >>>>>> node163 >>>>>> >>>>>> >>>>>> On Mon, Mar 21, 2011 at 9:22 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> Can you run anything under TM? Try running "hostname" directly from >>>>>>> Torque to see if anything works at all. >>>>>>> >>>>>>> The error message is telling you that the Torque daemon on the remote >>>>>>> node reported a failure when trying to launch the OMPI daemon. Could be >>>>>>> that Torque isn't setup to forward environments so the OMPI daemon >>>>>>> isn't finding required libs. You could directly run "printenv" to see >>>>>>> how your remote environ is being setup. >>>>>>> >>>>>>> Could be that the tmp dir lacks correct permissions for a user to >>>>>>> create the required directories. The OMPI daemon tries to create a >>>>>>> session directory in the tmp dir, so failure to do so would indeed >>>>>>> cause the launch to fail. You can specify the tmp dir with a cmd line >>>>>>> option to mpirun. See "mpirun -h" for info. >>>>>>> >>>>>>> >>>>>>> On Mar 21, 2011, at 12:21 AM, Randall Svancara wrote: >>>>>>> >>>>>>>> I have a question about using OpenMPI and Torque on stateless nodes. >>>>>>>> I have compiled openmpi 1.4.3 with --with-tm=/usr/local >>>>>>>> --without-slurm using intel compiler version 11.1.075. >>>>>>>> >>>>>>>> When I run a simple "hello world" mpi program, I am receiving the >>>>>>>> following error. >>>>>>>> >>>>>>>> [node164:11193] plm:tm: failed to poll for a spawned daemon, return >>>>>>>> status = 17002 >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting >>>>>>>> to >>>>>>>> launch so we are aborting. >>>>>>>> >>>>>>>> There may be more information reported by the environment (see above). >>>>>>>> >>>>>>>> This may be because the daemon was unable to find all the needed shared >>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>>>>>> the >>>>>>>> location of the shared libraries on the remote nodes and this will >>>>>>>> automatically be forwarded to the remote nodes. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> mpiexec noticed that the job aborted, but has no info as to the process >>>>>>>> that caused that situation. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown >>>>>>>> below. Additional manual cleanup may be required - please refer to >>>>>>>> the "orte-clean" tool for assistance. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> node163 - daemon did not report back when launched >>>>>>>> node159 - daemon did not report back when launched >>>>>>>> node158 - daemon did not report back when launched >>>>>>>> node157 - daemon did not report back when launched >>>>>>>> node156 - daemon did not report back when launched >>>>>>>> node155 - daemon did not report back when launched >>>>>>>> node154 - daemon did not report back when launched >>>>>>>> node152 - daemon did not report back when launched >>>>>>>> node151 - daemon did not report back when launched >>>>>>>> node150 - daemon did not report back when launched >>>>>>>> node149 - daemon did not report back when launched >>>>>>>> >>>>>>>> >>>>>>>> But if I include: >>>>>>>> >>>>>>>> -mca plm rsh >>>>>>>> >>>>>>>> The job runs just fine. >>>>>>>> >>>>>>>> I am not sure what the problem is with torque or openmpi that prevents >>>>>>>> the process from launching on remote nodes. I have posted to the >>>>>>>> torque list and someone suggested that it may be temporary directory >>>>>>>> space that can be causing issues. I have 100MB allocated to /tmp >>>>>>>> >>>>>>>> Any ideas as to why I am having this problem would be appreciated. >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Randall Svancara >>>>>>>> http://knowyourlinux.com/ >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Randall Svancara >>>>>> http://knowyourlinux.com/ >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> >>> -- >>> Randall Svancara >>> http://knowyourlinux.com/ >>> >> >> >> >> -- >> Randall Svancara >> http://knowyourlinux.com/ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Randall Svancara http://knowyourlinux.com/