I have checked those out.

I am trying to test limits. If I ssh directly to a node and check,
everything is ok:
[andrus@login1 ~]$ ssh n01 ulimit -l
unlimited

The settings in /etc/security/limits.conf are right too. 


Brian Andrus perotsystems 
Site Manager | Sr. Computer Scientist 
Naval Research Lab
7 Grace Hopper Ave, Monterey, CA  93943
Phone (831) 656-4839 | Fax (831) 656-4866 


-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Jeff Squyres
Sent: Wednesday, November 07, 2007 4:26 PM
To: Open MPI Users
Subject: Re: [OMPI users] openib errors as user, but not root

Check out:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more

In particular, see the stuff about using resource managers.



On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote:

> Ok, I am having some difficulty troubleshooting this.
>
> If I run my hello program without torque, it works fine:
> [root@login1 root]# mpirun --mca btl openib,self -host
> n01,n02,n03,n04,n05 /data/root/hello
> Hello from process 0 of 5 on node n01
> Hello from process 1 of 5 on node n02
> Hello from process 2 of 5 on node n03
> Hello from process 3 of 5 on node n04
> Hello from process 4 of 5 on node n05
>
> If I submit it as root, it seems happy:
> [root@login1 root]# qsub
> #!/bin/bash
> #PBS -j oe
> #PBS -l nodes=5:ppn=1
> #PBS -W x=NACCESSPOLICY:SINGLEJOB
> #PBS -N TestJob
> #PBS -q long
> #PBS -o output.txt
> #PBS -V
> cd $PBS_O_WORKDIR
> rm -f output.txt
> date
> mpirun --mca btl openib,self /data/root/hello 
> 103.cluster.default.domain
> [root@login1 root]# cat output.txt
> Wed Nov  7 16:20:33 PST 2007
> Hello from process 0 of 5 on node n05
> Hello from process 1 of 5 on node n04
> Hello from process 2 of 5 on node n03
> Hello from process 3 of 5 on node n02
> Hello from process 4 of 5 on node n01
>
> If I do it as me, not so good:
> [andrus@login1 data]$ qsub
> [andrus@login1 data]$ qsub
> #!/bin/bash
> #PBS -j oe
> #PBS -l nodes=1:ppn=1
> #PBS -W x=NACCESSPOLICY:SINGLEJOB
> #PBS -N TestJob
> #PBS -q long
> #PBS -o output.txt
> #PBS -V
> cd $PBS_O_WORKDIR
> rm -f output.txt
> date
> mpirun --mca btl openib,self /data/root/hello 
> 105.littlemac.default.domain
> [andrus@login1 data]$ cat output.txt
> Wed Nov  7 16:23:00 PST 2007
> ----------------------------------------------------------------------
> ---- The OpenIB BTL failed to initialize while trying to allocate some

> locked memory.  This typically can indicate that the memlock limits 
> are set too low.  For most HPC installations, the memlock limits 
> should be set to "unlimited".  The failure occured here:
>
>     Host:          n01
>     OMPI source:   btl_openib.c:828
>     Function:      ibv_create_cq()
>     Device:        mthca0
>     Memlock limit: 32768
>
> You may need to consult with your system administrator to get this 
> problem fixed.  This FAQ entry on the Open MPI web site may also be
> helpful:
>
>     http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ---- It looks like MPI_INIT failed for some reason; your parallel 
> process is likely to abort.  There are many reasons that a parallel 
> process can fail during MPI_INIT; some of which are due to 
> configuration or environment problems.  This failure appears to be an 
> internal failure; here's some additional information (which may only 
> be relevant to an Open MPI
> developer):
>
>   PML add procs failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> ----------------------------------------------------------------------
> ----
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
>
>
>
> I have checked that ulimit is unlimited. I cannot seem to figure this.

> Any help?
> Brian Andrus perotsystems
> Site Manager | Sr. Computer Scientist
> Naval Research Lab
> 7 Grace Hopper Ave, Monterey, CA  93943 Phone (831) 656-4839 | Fax 
> (831) 656-4866 _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to