Ok, I am having some difficulty troubleshooting this. If I run my hello program without torque, it works fine: [root@login1 root]# mpirun --mca btl openib,self -host n01,n02,n03,n04,n05 /data/root/hello Hello from process 0 of 5 on node n01 Hello from process 1 of 5 on node n02 Hello from process 2 of 5 on node n03 Hello from process 3 of 5 on node n04 Hello from process 4 of 5 on node n05
If I submit it as root, it seems happy: [root@login1 root]# qsub #!/bin/bash #PBS -j oe #PBS -l nodes=5:ppn=1 #PBS -W x=NACCESSPOLICY:SINGLEJOB #PBS -N TestJob #PBS -q long #PBS -o output.txt #PBS -V cd $PBS_O_WORKDIR rm -f output.txt date mpirun --mca btl openib,self /data/root/hello 103.cluster.default.domain [root@login1 root]# cat output.txt Wed Nov 7 16:20:33 PST 2007 Hello from process 0 of 5 on node n05 Hello from process 1 of 5 on node n04 Hello from process 2 of 5 on node n03 Hello from process 3 of 5 on node n02 Hello from process 4 of 5 on node n01 If I do it as me, not so good: [andrus@login1 data]$ qsub [andrus@login1 data]$ qsub #!/bin/bash #PBS -j oe #PBS -l nodes=1:ppn=1 #PBS -W x=NACCESSPOLICY:SINGLEJOB #PBS -N TestJob #PBS -q long #PBS -o output.txt #PBS -V cd $PBS_O_WORKDIR rm -f output.txt date mpirun --mca btl openib,self /data/root/hello 105.littlemac.default.domain [andrus@login1 data]$ cat output.txt Wed Nov 7 16:23:00 PST 2007 ------------------------------------------------------------------------ -- The OpenIB BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occured here: Host: n01 OMPI source: btl_openib.c:828 Function: ibv_create_cq() Device: mthca0 Memlock limit: 32768 You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages ------------------------------------------------------------------------ -- ------------------------------------------------------------------------ -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) ------------------------------------------------------------------------ -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) I have checked that ulimit is unlimited. I cannot seem to figure this. Any help? Brian Andrus perotsystems Site Manager | Sr. Computer Scientist Naval Research Lab 7 Grace Hopper Ave, Monterey, CA 93943 Phone (831) 656-4839 | Fax (831) 656-4866