Yeah, the system admin is me lol.....and this is a new system which I
am frantically trying to work out all the bugs.  Torque and MPI are my
last hurdles to overcome.  But I have already been through some faulty
infiniband equipment, bad memory and bad drives.....which is to be
expected on a cluster.


I wish there was some kind of TM test tool, that would be really nice
for testing.

I will ping the Torque list again.  Originally they forwarded me to
the openmpi list.

On Mon, Mar 21, 2011 at 12:29 PM, Ralph Castain <r...@open-mpi.org> wrote:
> mpiexec doesn't use pbsdsh (we use a TM API), but the affect is the same. 
> Been so long since I ran on a Torque machine, though, that I honestly don't 
> remember how to set the LD_LIBRARY_PATH on the backend.
>
> Do you have a sys admin there whom you could ask? Or you could ping the 
> Torque list about it - pretty standard issue.
>
>
> On Mar 21, 2011, at 1:19 PM, Randall Svancara wrote:
>
>> Hi.  The pbsdsh tool is great.  I ran an interactive qsub session
>> (qsub -I -lnodes=2:ppn=12) and then rand the pbsdsh tool like this:
>>
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164 printenv
>> PATH=/bin:/usr/bin
>> LANG=C
>> PBS_O_HOME=/home/admins/rsvancara
>> PBS_O_LANG=en_US.UTF-8
>> PBS_O_LOGNAME=rsvancara
>> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>> PBS_O_MAIL=/var/spool/mail/rsvancara
>> PBS_O_SHELL=/bin/bash
>> PBS_SERVER=mgt1.wsuhpc.edu
>> PBS_O_WORKDIR=/home/admins/rsvancara/TEST
>> PBS_O_QUEUE=batch
>> PBS_O_HOST=login1
>> HOME=/home/admins/rsvancara
>> PBS_JOBNAME=STDIN
>> PBS_JOBID=1672.mgt1.wsuhpc.edu
>> PBS_QUEUE=batch
>> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
>> PBS_NODENUM=0
>> PBS_TASKNUM=146
>> PBS_MOMPORT=15003
>> PBS_NODEFILE=/var/spool/torque/aux//1672.mgt1.wsuhpc.edu
>> PBS_VERSION=TORQUE-2.4.7
>> PBS_VNODENUM=0
>> PBS_ENVIRONMENT=PBS_BATCH
>> ENVIRONMENT=BATCH
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163 printenv
>> PATH=/bin:/usr/bin
>> LANG=C
>> PBS_O_HOME=/home/admins/rsvancara
>> PBS_O_LANG=en_US.UTF-8
>> PBS_O_LOGNAME=rsvancara
>> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>> PBS_O_MAIL=/var/spool/mail/rsvancara
>> PBS_O_SHELL=/bin/bash
>> PBS_SERVER=mgt1.wsuhpc.edu
>> PBS_O_WORKDIR=/home/admins/rsvancara/TEST
>> PBS_O_QUEUE=batch
>> PBS_O_HOST=login1
>> HOME=/home/admins/rsvancara
>> PBS_JOBNAME=STDIN
>> PBS_JOBID=1672.mgt1.wsuhpc.edu
>> PBS_QUEUE=batch
>> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
>> PBS_NODENUM=1
>> PBS_TASKNUM=147
>> PBS_MOMPORT=15003
>> PBS_VERSION=TORQUE-2.4.7
>> PBS_VNODENUM=12
>> PBS_ENVIRONMENT=PBS_BATCH
>> ENVIRONMENT=BATCH
>>
>> So one thing that strikes me as bad is the LD_LIBRARY_PATH does not
>> appear available.  Attempted to run mpiexec like this and it fails.
>>
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
>> loading shared libraries: libimf.so: cannot open shared object file:
>> No such file or directory
>> pbsdsh: task 12 exit status 127
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
>> loading shared libraries: libimf.so: cannot open shared object file:
>> No such file or directory
>> pbsdsh: task 0 exit status 127
>>
>> If this is how the openmpi processes are being launched, then it is no
>> wonder they are failing and the LD_LIBRARY_PATH error message is
>> indeed somewhat accurate.
>>
>> So the next question is how to I ensure that this information is
>> available to pbsdsh?
>>
>> Thanks,
>>
>> Randall
>>
>>
>> On Mon, Mar 21, 2011 at 11:24 AM, Randall Svancara <rsvanc...@gmail.com> 
>> wrote:
>>> Ok, these are good things to check.  I am going to follow through with
>>> this in the next hour after our GPFS upgrade.  Thanks!!!
>>>
>>> On Mon, Mar 21, 2011 at 11:14 AM, Brock Palen <bro...@umich.edu> wrote:
>>>> On Mar 21, 2011, at 1:59 PM, Jeff Squyres wrote:
>>>>
>>>>> I no longer run Torque on my cluster, so my Torqueology is pretty rusty 
>>>>> -- but I think there's a Torque command to launch on remote nodes.  tmrsh 
>>>>> or pbsrsh or something like that...?
>>>>
>>>> pbsdsh
>>>> If TM is working pbsdsh should work fine.
>>>>
>>>> Torque+OpenMPI has been working just fine for us.
>>>> Do you have libtorque on all your compute hosts?  You should see it open 
>>>> on all hosts if it works.
>>>>
>>>>>
>>>>> Try that and make sure it works.  Open MPI should be using the same API 
>>>>> as that command under the covers.
>>>>>
>>>>> I also have a dim recollection that the TM API support library(ies?) may 
>>>>> not be installed by default.  You may have to ensure that they're 
>>>>> available on all nodes...?
>>>>>
>>>>>
>>>>> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
>>>>>
>>>>>> I am not sure if there is any extra configuration necessary for torque
>>>>>> to forward the environment.  I have included the output of printenv
>>>>>> for an interactive qsub session.  I am really at a loss here because I
>>>>>> never had this much difficulty making torque run with openmpi.  It has
>>>>>> been mostly a good experience.
>>>>>>
>>>>>> Permissions of /tmp
>>>>>>
>>>>>> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
>>>>>>
>>>>>> mpiexec hostname single node:
>>>>>>
>>>>>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
>>>>>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
>>>>>> qsub: job 1667.mgt1.wsuhpc.edu ready
>>>>>>
>>>>>> [rsvancara@node100 ~]$ mpiexec hostname
>>>>>> node100
>>>>>> node100
>>>>>> node100
>>>>>> node100
>>>>>> node100
>>>>>> node100
>>>>>> node100
>>>>>> node100
>>>>>> node100
>>>>>> node100
>>>>>> node100
>>>>>> node100
>>>>>>
>>>>>> mpiexec hostname two nodes:
>>>>>>
>>>>>> [rsvancara@node100 ~]$ mpiexec hostname
>>>>>> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
>>>>>> status = 17002
>>>>>> --------------------------------------------------------------------------
>>>>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>>>>> launch so we are aborting.
>>>>>>
>>>>>> There may be more information reported by the environment (see above).
>>>>>>
>>>>>> This may be because the daemon was unable to find all the needed shared
>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have 
>>>>>> the
>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>> automatically be forwarded to the remote nodes.
>>>>>> --------------------------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpiexec noticed that the job aborted, but has no info as to the process
>>>>>> that caused that situation.
>>>>>> --------------------------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>> the "orte-clean" tool for assistance.
>>>>>> --------------------------------------------------------------------------
>>>>>>      node99 - daemon did not report back when launched
>>>>>>
>>>>>>
>>>>>> MPIexec on one node with one cpu:
>>>>>>
>>>>>> [rsvancara@node164 ~]$ mpiexec printenv
>>>>>> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
>>>>>> MODULE_VERSION_STACK=3.2.8
>>>>>> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
>>>>>> HOSTNAME=node164
>>>>>> PBS_VERSION=TORQUE-2.4.7
>>>>>> TERM=xterm
>>>>>> SHELL=/bin/bash
>>>>>> HISTSIZE=1000
>>>>>> PBS_JOBNAME=STDIN
>>>>>> PBS_ENVIRONMENT=PBS_INTERACTIVE
>>>>>> PBS_O_WORKDIR=/home/admins/rsvancara
>>>>>> PBS_TASKNUM=1
>>>>>> USER=rsvancara
>>>>>> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
>>>>>> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
>>>>>> PBS_O_HOME=/home/admins/rsvancara
>>>>>> CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include
>>>>>> PBS_MOMPORT=15003
>>>>>> PBS_O_QUEUE=batch
>>>>>> NLSPATH=/home/software/intel/Compiler/11.1/075/lib/intel64/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/ipp/em64t/lib/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/idb/intel64/locale/%l_%t/%N
>>>>>> MODULE_VERSION=3.2.8
>>>>>> MAIL=/var/spool/mail/rsvancara
>>>>>> PBS_O_LOGNAME=rsvancara
>>>>>> PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>>>>>> PBS_O_LANG=en_US.UTF-8
>>>>>> PBS_JOBCOOKIE=D52DE562B685A462849C1136D6B581F9
>>>>>> INPUTRC=/etc/inputrc
>>>>>> PWD=/home/admins/rsvancara
>>>>>> _LMFILES_=/home/software/Modules/3.2.8/modulefiles/modules:/home/software/Modules/3.2.8/modulefiles/null:/home/software/modulefiles/intel/11.1.075:/home/software/modulefiles/openmpi/1.4.3_intel
>>>>>> PBS_NODENUM=0
>>>>>> LANG=C
>>>>>> MODULEPATH=/home/software/Modules/versions:/home/software/Modules/$MODULE_VERSION/modulefiles:/home/software/modulefiles
>>>>>> LOADEDMODULES=modules:null:intel/11.1.075:openmpi/1.4.3_intel
>>>>>> PBS_O_SHELL=/bin/bash
>>>>>> PBS_SERVER=mgt1.wsuhpc.edu
>>>>>> PBS_JOBID=1670.mgt1.wsuhpc.edu
>>>>>> SHLVL=1
>>>>>> HOME=/home/admins/rsvancara
>>>>>> INTEL_LICENSES=/home/software/intel/Compiler/11.1/075/licenses:/opt/intel/licenses
>>>>>> PBS_O_HOST=login1
>>>>>> DYLD_LIBRARY_PATH=/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib
>>>>>> PBS_VNODENUM=0
>>>>>> LOGNAME=rsvancara
>>>>>> PBS_QUEUE=batch
>>>>>> MODULESHOME=/home/software/mpi/intel/openmpi-1.4.3
>>>>>> LESSOPEN=|/usr/bin/lesspipe.sh %s
>>>>>> PBS_O_MAIL=/var/spool/mail/rsvancara
>>>>>> G_BROKEN_FILENAMES=1
>>>>>> PBS_NODEFILE=/var/spool/torque/aux//1670.mgt1.wsuhpc.edu
>>>>>> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>>>>>> module=() {  eval `/home/software/Modules/$MODULE_VERSION/bin/modulecmd 
>>>>>> bash $*`
>>>>>> }
>>>>>> _=/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec
>>>>>> OMPI_MCA_orte_local_daemon_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559
>>>>>> OMPI_MCA_orte_hnp_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559
>>>>>> OMPI_MCA_mpi_yield_when_idle=0
>>>>>> OMPI_MCA_orte_app_num=0
>>>>>> OMPI_UNIVERSE_SIZE=1
>>>>>> OMPI_MCA_ess=env
>>>>>> OMPI_MCA_orte_ess_num_procs=1
>>>>>> OMPI_COMM_WORLD_SIZE=1
>>>>>> OMPI_COMM_WORLD_LOCAL_SIZE=1
>>>>>> OMPI_MCA_orte_ess_jobid=3236233217
>>>>>> OMPI_MCA_orte_ess_vpid=0
>>>>>> OMPI_COMM_WORLD_RANK=0
>>>>>> OMPI_COMM_WORLD_LOCAL_RANK=0
>>>>>> OPAL_OUTPUT_STDERR_FD=19
>>>>>>
>>>>>> MPIExec with -mca plm rsh:
>>>>>>
>>>>>> [rsvancara@node164 ~]$ mpiexec -mca plm rsh -mca orte_tmpdir_base
>>>>>> /fastscratch/admins/tmp hostname
>>>>>> node164
>>>>>> node164
>>>>>> node164
>>>>>> node164
>>>>>> node164
>>>>>> node164
>>>>>> node164
>>>>>> node164
>>>>>> node164
>>>>>> node164
>>>>>> node164
>>>>>> node164
>>>>>> node163
>>>>>> node163
>>>>>> node163
>>>>>> node163
>>>>>> node163
>>>>>> node163
>>>>>> node163
>>>>>> node163
>>>>>> node163
>>>>>> node163
>>>>>> node163
>>>>>> node163
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 21, 2011 at 9:22 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>> Can you run anything under TM? Try running "hostname" directly from 
>>>>>>> Torque to see if anything works at all.
>>>>>>>
>>>>>>> The error message is telling you that the Torque daemon on the remote 
>>>>>>> node reported a failure when trying to launch the OMPI daemon. Could be 
>>>>>>> that Torque isn't setup to forward environments so the OMPI daemon 
>>>>>>> isn't finding required libs. You could directly run "printenv" to see 
>>>>>>> how your remote environ is being setup.
>>>>>>>
>>>>>>> Could be that the tmp dir lacks correct permissions for a user to 
>>>>>>> create the required directories. The OMPI daemon tries to create a 
>>>>>>> session directory in the tmp dir, so failure to do so would indeed 
>>>>>>> cause the launch to fail. You can specify the tmp dir with a cmd line 
>>>>>>> option to mpirun. See "mpirun -h" for info.
>>>>>>>
>>>>>>>
>>>>>>> On Mar 21, 2011, at 12:21 AM, Randall Svancara wrote:
>>>>>>>
>>>>>>>> I have a question about using OpenMPI and Torque on stateless nodes.
>>>>>>>> I have compiled openmpi 1.4.3 with --with-tm=/usr/local
>>>>>>>> --without-slurm using intel compiler version 11.1.075.
>>>>>>>>
>>>>>>>> When I run a simple "hello world" mpi program, I am receiving the
>>>>>>>> following error.
>>>>>>>>
>>>>>>>> [node164:11193] plm:tm: failed to poll for a spawned daemon, return
>>>>>>>> status = 17002
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting 
>>>>>>>> to
>>>>>>>> launch so we are aborting.
>>>>>>>>
>>>>>>>> There may be more information reported by the environment (see above).
>>>>>>>>
>>>>>>>> This may be because the daemon was unable to find all the needed shared
>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have 
>>>>>>>> the
>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpiexec noticed that the job aborted, but has no info as to the process
>>>>>>>> that caused that situation.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>>> the "orte-clean" tool for assistance.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>        node163 - daemon did not report back when launched
>>>>>>>>        node159 - daemon did not report back when launched
>>>>>>>>        node158 - daemon did not report back when launched
>>>>>>>>        node157 - daemon did not report back when launched
>>>>>>>>        node156 - daemon did not report back when launched
>>>>>>>>        node155 - daemon did not report back when launched
>>>>>>>>        node154 - daemon did not report back when launched
>>>>>>>>        node152 - daemon did not report back when launched
>>>>>>>>        node151 - daemon did not report back when launched
>>>>>>>>        node150 - daemon did not report back when launched
>>>>>>>>        node149 - daemon did not report back when launched
>>>>>>>>
>>>>>>>>
>>>>>>>> But if I include:
>>>>>>>>
>>>>>>>> -mca plm rsh
>>>>>>>>
>>>>>>>> The job runs just fine.
>>>>>>>>
>>>>>>>> I am not sure what the problem is with torque or openmpi that prevents
>>>>>>>> the process from launching on remote nodes.  I have posted to the
>>>>>>>> torque list and someone suggested that it may be temporary directory
>>>>>>>> space that can be causing issues.  I have 100MB allocated to /tmp
>>>>>>>>
>>>>>>>> Any ideas as to why I am having this problem would be appreciated.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Randall Svancara
>>>>>>>> http://knowyourlinux.com/
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Randall Svancara
>>>>>> http://knowyourlinux.com/
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>>
>>> --
>>> Randall Svancara
>>> http://knowyourlinux.com/
>>>
>>
>>
>>
>> --
>> Randall Svancara
>> http://knowyourlinux.com/
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Randall Svancara
http://knowyourlinux.com/

Reply via email to