On Nov 8, 2007, at 12:47 PM, Andrus, Mr. Brian (Contractor) wrote:

Yep. I thought it was the startup script, but it was merely the fact I
restarted it that got it going.

I wonder if adding ulimit -l unlimited to the startup script will help,
tho.

Yes, that is what the FAQ item is trying to say.

System-level daemons are not subject to PAM limits (PAM limits are more for interactive logins and the like). Hence, they start with the system defaults of 32k limits on locked memory. And therefore the user jobs that they launch will inherit the 32k limit. A non-root process can *decrease* its locked memory limit, but it cannot *increase* it above the set limit. Hence, user jobs started by the PBS MOMs will be limited to the 32k limit.

The solution is to put the ulimit in the startup script for PBS itself (or whatever your resource manager is). This allows re-setting the limits while you're still running as root. Then the PBS MOM will have an unlimited limit for locked memory, and the processes that it launches will inherit this unlimited limit.





Brian Andrus perotsystems
Site Manager | Sr. Computer Scientist
Naval Research Lab
7 Grace Hopper Ave, Monterey, CA  93943
Phone (831) 656-4839 | Fax (831) 656-4866


-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of pat.o'bry...@exxonmobil.com
Sent: Thursday, November 08, 2007 5:55 AM
To: Open MPI Users
Subject: Re: [OMPI users] openib errors as user, but not root

What we discovered is that our PBS mom daemon did not have unlimited
locked memory. So, when your job is created by the mom daemon it
inherits the memory limits. The fix was to cycle the PBS mom daemon
after a boot (and yes, we do start the mom daemon at boot but for some
reason it doesn't inherit unlimited locked memory). The way to determine if this is the problem is to place a "ulimit -a" in the text of your PBS
job. Run your job and you will see a limit of 32K. Next cycle the mom
daemon on the node(s) of interest and re-run your job. You will now see
unlimited memory.
        Thanks,
         Pat O'Bryant


J.W. (Pat) O'Bryant,Jr.
Business Line Infrastructure
Technical Systems, HPC






            Jeff Squyres

            <jsquyres@cisc

            o.com>
To
            Sent by:                 Open MPI Users
<us...@open-mpi.org>
            users-bounces@
cc
            open-mpi.org


Subject
                                     Re: [OMPI users] openib errors as

            11/07/07 06:25           user, but not root

            PM





            Please respond

                  to

            Open MPI Users

            <users@open-mp

                i.org>













Check out:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more

In particular, see the stuff about using resource managers.



On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote:

Ok, I am having some difficulty troubleshooting this.

If I run my hello program without torque, it works fine:
[root@login1 root]# mpirun --mca btl openib,self -host
n01,n02,n03,n04,n05 /data/root/hello
Hello from process 0 of 5 on node n01
Hello from process 1 of 5 on node n02
Hello from process 2 of 5 on node n03
Hello from process 3 of 5 on node n04
Hello from process 4 of 5 on node n05

If I submit it as root, it seems happy:
[root@login1 root]# qsub
#!/bin/bash
#PBS -j oe
#PBS -l nodes=5:ppn=1
#PBS -W x=NACCESSPOLICY:SINGLEJOB
#PBS -N TestJob
#PBS -q long
#PBS -o output.txt
#PBS -V
cd $PBS_O_WORKDIR
rm -f output.txt
date
mpirun --mca btl openib,self /data/root/hello
103.cluster.default.domain
[root@login1 root]# cat output.txt
Wed Nov  7 16:20:33 PST 2007
Hello from process 0 of 5 on node n05
Hello from process 1 of 5 on node n04
Hello from process 2 of 5 on node n03
Hello from process 3 of 5 on node n02
Hello from process 4 of 5 on node n01

If I do it as me, not so good:
[andrus@login1 data]$ qsub
[andrus@login1 data]$ qsub
#!/bin/bash
#PBS -j oe
#PBS -l nodes=1:ppn=1
#PBS -W x=NACCESSPOLICY:SINGLEJOB
#PBS -N TestJob
#PBS -q long
#PBS -o output.txt
#PBS -V
cd $PBS_O_WORKDIR
rm -f output.txt
date
mpirun --mca btl openib,self /data/root/hello
105.littlemac.default.domain
[andrus@login1 data]$ cat output.txt
Wed Nov  7 16:23:00 PST 2007

------------------------------------------------------------------------
--
The OpenIB BTL failed to initialize while trying to allocate some
locked memory.  This typically can indicate that the memlock limits
are set too low.  For most HPC installations, the memlock limits
should be set to "unlimited".  The failure occured here:

   Host:          n01
   OMPI source:   btl_openib.c:828
   Function:      ibv_create_cq()
   Device:        mthca0
   Memlock limit: 32768

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

   http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

------------------------------------------------------------------------
--

------------------------------------------------------------------------
--
It looks like MPI_INIT failed for some reason; your parallel process
is likely to abort.  There are many reasons that a parallel process
can fail during MPI_INIT; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an
Open MPI
developer):

 PML add procs failed
 --> Returned "Error" (-1) instead of "Success" (0)

------------------------------------------------------------------------
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)



I have checked that ulimit is unlimited. I cannot seem to figure this.

Any help?
Brian Andrus perotsystems
Site Manager | Sr. Computer Scientist
Naval Research Lab
7 Grace Hopper Ave, Monterey, CA  93943 Phone (831) 656-4839 | Fax
(831) 656-4866 _______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to