Check out:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more
In particular, see the stuff about using resource managers.
On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote:
Ok, I am having some difficulty troubleshooting this.
If I run my hello program without torque, it works fine:
[root@login1 root]# mpirun --mca btl openib,self -host
n01,n02,n03,n04,n05 /data/root/hello
Hello from process 0 of 5 on node n01
Hello from process 1 of 5 on node n02
Hello from process 2 of 5 on node n03
Hello from process 3 of 5 on node n04
Hello from process 4 of 5 on node n05
If I submit it as root, it seems happy:
[root@login1 root]# qsub
#!/bin/bash
#PBS -j oe
#PBS -l nodes=5:ppn=1
#PBS -W x=NACCESSPOLICY:SINGLEJOB
#PBS -N TestJob
#PBS -q long
#PBS -o output.txt
#PBS -V
cd $PBS_O_WORKDIR
rm -f output.txt
date
mpirun --mca btl openib,self /data/root/hello
103.cluster.default.domain
[root@login1 root]# cat output.txt
Wed Nov 7 16:20:33 PST 2007
Hello from process 0 of 5 on node n05
Hello from process 1 of 5 on node n04
Hello from process 2 of 5 on node n03
Hello from process 3 of 5 on node n02
Hello from process 4 of 5 on node n01
If I do it as me, not so good:
[andrus@login1 data]$ qsub
[andrus@login1 data]$ qsub
#!/bin/bash
#PBS -j oe
#PBS -l nodes=1:ppn=1
#PBS -W x=NACCESSPOLICY:SINGLEJOB
#PBS -N TestJob
#PBS -q long
#PBS -o output.txt
#PBS -V
cd $PBS_O_WORKDIR
rm -f output.txt
date
mpirun --mca btl openib,self /data/root/hello
105.littlemac.default.domain
[andrus@login1 data]$ cat output.txt
Wed Nov 7 16:23:00 PST 2007
--------------------------------------------------------------------------
The OpenIB BTL failed to initialize while trying to allocate some
locked memory. This typically can indicate that the memlock limits
are set too low. For most HPC installations, the memlock limits
should be set to "unlimited". The failure occured here:
Host: n01
OMPI source: btl_openib.c:828
Function: ibv_create_cq()
Device: mthca0
Memlock limit: 32768
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process
is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
I have checked that ulimit is unlimited. I cannot seem to figure
this. Any help?
Brian Andrus perotsystems
Site Manager | Sr. Computer Scientist
Naval Research Lab
7 Grace Hopper Ave, Monterey, CA 93943
Phone (831) 656-4839 | Fax (831) 656-4866
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems