Jeff,
       Thanks for the clear explanation.
                  Pat

J.W. (Pat) O'Bryant,Jr.
Business Line Infrastructure
Technical Systems, HPC





             Jeff Squyres                                                  
             <jsquyres@cisc                                                
             o.com>                                                     To 
             Sent by:                 Open MPI Users <us...@open-mpi.org>  
             users-bounces@                                             cc 
             open-mpi.org                                                  
                                                                   Subject 
                                      Re: [OMPI users] openib errors as    
             11/09/07 06:18           user, but not root                   
             AM                                                            


             Please respond                                                
                   to                                                      
             Open MPI Users                                                
             <users@open-mp                                                
                 i.org>                                                    








On Nov 8, 2007, at 12:47 PM, Andrus, Mr. Brian (Contractor) wrote:

> Yep. I thought it was the startup script, but it was merely the fact I
> restarted it that got it going.
>
> I wonder if adding ulimit -l unlimited to the startup script will
> help,
> tho.

Yes, that is what the FAQ item is trying to say.

System-level daemons are not subject to PAM limits (PAM limits are
more for interactive logins and the like).  Hence, they start with the
system defaults of 32k limits on locked memory.  And therefore the
user jobs that they launch will inherit the 32k limit.  A non-root
process can *decrease* its locked memory limit, but it cannot
*increase* it above the set limit.  Hence, user jobs started by the
PBS MOMs will be limited to the 32k limit.

The solution is to put the ulimit in the startup script for PBS itself
(or whatever your resource manager is).  This allows re-setting the
limits while you're still running as root.  Then the PBS MOM will have
an unlimited limit for locked memory, and the processes that it
launches will inherit this unlimited limit.



>
>
> Brian Andrus perotsystems
> Site Manager | Sr. Computer Scientist
> Naval Research Lab
> 7 Grace Hopper Ave, Monterey, CA  93943
> Phone (831) 656-4839 | Fax (831) 656-4866
>
>
> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
> Behalf Of pat.o'bry...@exxonmobil.com
> Sent: Thursday, November 08, 2007 5:55 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] openib errors as user, but not root
>
> What we discovered is that our PBS mom daemon did not have unlimited
> locked memory. So, when your job is created by the mom daemon it
> inherits the memory limits. The fix was to cycle the PBS mom daemon
> after a boot (and yes, we do start the mom daemon at boot but for some
> reason it doesn't inherit unlimited locked memory). The way to
> determine
> if this is the problem is to place a "ulimit -a" in the text of your
> PBS
> job. Run your job and you will see a limit of 32K. Next cycle the mom
> daemon on the node(s) of interest and re-run your job. You will now
> see
> unlimited memory.
>         Thanks,
>          Pat O'Bryant
>
>
> J.W. (Pat) O'Bryant,Jr.
> Business Line Infrastructure
> Technical Systems, HPC
>
>
>
>
>
>
>             Jeff Squyres
>
>             <jsquyres@cisc
>
>             o.com>
> To
>             Sent by:                 Open MPI Users
> <us...@open-mpi.org>
>             users-bounces@
> cc
>             open-mpi.org
>
>
> Subject
>                                      Re: [OMPI users] openib errors as
>
>             11/07/07 06:25           user, but not root
>
>             PM
>
>
>
>
>
>             Please respond
>
>                   to
>
>             Open MPI Users
>
>             <users@open-mp
>
>                 i.org>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Check out:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more
>
> In particular, see the stuff about using resource managers.
>
>
>
> On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote:
>
>> Ok, I am having some difficulty troubleshooting this.
>>
>> If I run my hello program without torque, it works fine:
>> [root@login1 root]# mpirun --mca btl openib,self -host
>> n01,n02,n03,n04,n05 /data/root/hello
>> Hello from process 0 of 5 on node n01
>> Hello from process 1 of 5 on node n02
>> Hello from process 2 of 5 on node n03
>> Hello from process 3 of 5 on node n04
>> Hello from process 4 of 5 on node n05
>>
>> If I submit it as root, it seems happy:
>> [root@login1 root]# qsub
>> #!/bin/bash
>> #PBS -j oe
>> #PBS -l nodes=5:ppn=1
>> #PBS -W x=NACCESSPOLICY:SINGLEJOB
>> #PBS -N TestJob
>> #PBS -q long
>> #PBS -o output.txt
>> #PBS -V
>> cd $PBS_O_WORKDIR
>> rm -f output.txt
>> date
>> mpirun --mca btl openib,self /data/root/hello
>> 103.cluster.default.domain
>> [root@login1 root]# cat output.txt
>> Wed Nov  7 16:20:33 PST 2007
>> Hello from process 0 of 5 on node n05
>> Hello from process 1 of 5 on node n04
>> Hello from process 2 of 5 on node n03
>> Hello from process 3 of 5 on node n02
>> Hello from process 4 of 5 on node n01
>>
>> If I do it as me, not so good:
>> [andrus@login1 data]$ qsub
>> [andrus@login1 data]$ qsub
>> #!/bin/bash
>> #PBS -j oe
>> #PBS -l nodes=1:ppn=1
>> #PBS -W x=NACCESSPOLICY:SINGLEJOB
>> #PBS -N TestJob
>> #PBS -q long
>> #PBS -o output.txt
>> #PBS -V
>> cd $PBS_O_WORKDIR
>> rm -f output.txt
>> date
>> mpirun --mca btl openib,self /data/root/hello
>> 105.littlemac.default.domain
>> [andrus@login1 data]$ cat output.txt
>> Wed Nov  7 16:23:00 PST 2007
>>
> ------------------------------------------------------------------------
> --
>> The OpenIB BTL failed to initialize while trying to allocate some
>> locked memory.  This typically can indicate that the memlock limits
>> are set too low.  For most HPC installations, the memlock limits
>> should be set to "unlimited".  The failure occured here:
>>
>>    Host:          n01
>>    OMPI source:   btl_openib.c:828
>>    Function:      ibv_create_cq()
>>    Device:        mthca0
>>    Memlock limit: 32768
>>
>> You may need to consult with your system administrator to get this
>> problem fixed.  This FAQ entry on the Open MPI web site may also be
>> helpful:
>>
>>    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>
> ------------------------------------------------------------------------
> --
>>
> ------------------------------------------------------------------------
> --
>> It looks like MPI_INIT failed for some reason; your parallel process
>> is likely to abort.  There are many reasons that a parallel process
>> can fail during MPI_INIT; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal
>> failure;
>
>> here's some additional information (which may only be relevant to an
>> Open MPI
>> developer):
>>
>>  PML add procs failed
>>  --> Returned "Error" (-1) instead of "Success" (0)
>>
> ------------------------------------------------------------------------
> --
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (goodbye)
>>
>>
>>
>> I have checked that ulimit is unlimited. I cannot seem to figure
>> this.
>
>> Any help?
>> Brian Andrus perotsystems
>> Site Manager | Sr. Computer Scientist
>> Naval Research Lab
>> 7 Grace Hopper Ave, Monterey, CA  93943 Phone (831) 656-4839 | Fax
>> (831) 656-4866 _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to