I have checked those out. I am trying to test limits. If I ssh directly to a node and check, everything is ok: [andrus@login1 ~]$ ssh n01 ulimit -l unlimited
The settings in /etc/security/limits.conf are right too. Brian Andrus perotsystems Site Manager | Sr. Computer Scientist Naval Research Lab 7 Grace Hopper Ave, Monterey, CA 93943 Phone (831) 656-4839 | Fax (831) 656-4866 -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Wednesday, November 07, 2007 4:26 PM To: Open MPI Users Subject: Re: [OMPI users] openib errors as user, but not root Check out: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more In particular, see the stuff about using resource managers. On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote: > Ok, I am having some difficulty troubleshooting this. > > If I run my hello program without torque, it works fine: > [root@login1 root]# mpirun --mca btl openib,self -host > n01,n02,n03,n04,n05 /data/root/hello > Hello from process 0 of 5 on node n01 > Hello from process 1 of 5 on node n02 > Hello from process 2 of 5 on node n03 > Hello from process 3 of 5 on node n04 > Hello from process 4 of 5 on node n05 > > If I submit it as root, it seems happy: > [root@login1 root]# qsub > #!/bin/bash > #PBS -j oe > #PBS -l nodes=5:ppn=1 > #PBS -W x=NACCESSPOLICY:SINGLEJOB > #PBS -N TestJob > #PBS -q long > #PBS -o output.txt > #PBS -V > cd $PBS_O_WORKDIR > rm -f output.txt > date > mpirun --mca btl openib,self /data/root/hello > 103.cluster.default.domain > [root@login1 root]# cat output.txt > Wed Nov 7 16:20:33 PST 2007 > Hello from process 0 of 5 on node n05 > Hello from process 1 of 5 on node n04 > Hello from process 2 of 5 on node n03 > Hello from process 3 of 5 on node n02 > Hello from process 4 of 5 on node n01 > > If I do it as me, not so good: > [andrus@login1 data]$ qsub > [andrus@login1 data]$ qsub > #!/bin/bash > #PBS -j oe > #PBS -l nodes=1:ppn=1 > #PBS -W x=NACCESSPOLICY:SINGLEJOB > #PBS -N TestJob > #PBS -q long > #PBS -o output.txt > #PBS -V > cd $PBS_O_WORKDIR > rm -f output.txt > date > mpirun --mca btl openib,self /data/root/hello > 105.littlemac.default.domain > [andrus@login1 data]$ cat output.txt > Wed Nov 7 16:23:00 PST 2007 > ---------------------------------------------------------------------- > ---- The OpenIB BTL failed to initialize while trying to allocate some > locked memory. This typically can indicate that the memlock limits > are set too low. For most HPC installations, the memlock limits > should be set to "unlimited". The failure occured here: > > Host: n01 > OMPI source: btl_openib.c:828 > Function: ibv_create_cq() > Device: mthca0 > Memlock limit: 32768 > > You may need to consult with your system administrator to get this > problem fixed. This FAQ entry on the Open MPI web site may also be > helpful: > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > ---------------------------------------------------------------------- > ---- > ---------------------------------------------------------------------- > ---- It looks like MPI_INIT failed for some reason; your parallel > process is likely to abort. There are many reasons that a parallel > process can fail during MPI_INIT; some of which are due to > configuration or environment problems. This failure appears to be an > internal failure; here's some additional information (which may only > be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > ---------------------------------------------------------------------- > ---- > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (goodbye) > > > > I have checked that ulimit is unlimited. I cannot seem to figure this. > Any help? > Brian Andrus perotsystems > Site Manager | Sr. Computer Scientist > Naval Research Lab > 7 Grace Hopper Ave, Monterey, CA 93943 Phone (831) 656-4839 | Fax > (831) 656-4866 _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users