If you are using the native Torque capabilities to launch Open MPI jobs, note that limits.conf is not necessarily obeyed. I'm not a Torque expert, but you should probably check out:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more

And check the Torque docs about how it propagates and enforces such limits.


On May 17, 2008, at 10:58 AM, Javier Lazaro wrote:

I have install torque-2.3.0 and openmpi-1.2.3.
I make tests and I have discovered that the jobs launched with the parameter '-hostfile' or '-machinefile' stops are to exceed the limits in the file /etc/security/limits.conf
More details:

file hola.c

#include <stdio.h>
#include <unistd.h>
#include "mpi.h"
int main(int argc, char *argv[]){
        int rank;
        int size;
        int i;
        int namelen;
        char pn[MPI_MAX_PROCESSOR_NAME];

        MPI_Init(&argc,&argv);
        MPI_Comm_size(MPI_COMM_WORLD,
&size);
        MPI_Comm_rank(MPI_COMM_WORLD,&rank);
        MPI_Get_processor_name(pn,&namelen);

        sleep(rank);

        system("bash -c 'ulimit -a'");

        for (i=0;;i++) {
                if (i%100000000==0) {
printf("--> %i --> Hola desde %d, de un total de: %d. estoy en %s\n",i, rank, size,pn);
                }
        }
        MPI_Finalize();

        return 0;

}

##

> mpicc hola.c

file mpi3.sh

#!/bin/sh

#PBS -l nodes=3:ppn=1
#PBS -N pruebaMPI3
#PBS -o 3outpruebaMPIout3
#PBS -e 3errpruebaMPIerr3

cat ${PBS_NODEFILE}

mpirun -hostfile ${PBS_NODEFILE} /home/javier/mpi_hola/a.out

##

launch job with torque
> qsub mpi3.sh

##

termined

file 3outpruebaMPIout3
maquina3b
maquina2b
maquina1b
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 8185
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 8192
cpu time (seconds, -t) unlimited #limit maquina3b
max user processes              (-u) 8185
virtual memory          (kbytes, -v) 2511840
file locks                      (-x) unlimited
--> 0 --> Hola desde 0, de un total de: 3. estoy en maquina3b
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 8185
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) 880005
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) 60        #limit maquina2b
max user processes              (-u) 8185
virtual memory          (kbytes, -v) 2511840
file locks                      (-x) unlimited
--> 0 --> Hola desde 1, de un total de: 3. estoy en maquina2b
--> 100000000 --> Hola desde 0, de un total de: 3. estoy en maquina3b
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 8185
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) 880005
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) 60        #limit maquina1b
max user processes              (-u) 8185
virtual memory          (kbytes, -v) 2511840
file locks                      (-x) unlimited
--> 0 --> Hola desde 2, de un total de: 3. estoy en maquina1b
--> 100000000 --> Hola desde 1, de un total de: 3. estoy en maquina2b
--> 200000000 --> Hola desde 0, de un total de: 3. estoy en maquina3b
--> 100000000 --> Hola desde 2, de un total de: 3. estoy en maquina1b
--> 200000000 --> Hola desde 1, de un total de: 3. estoy en maquina2b
........
--> -500000000 --> Hola desde 1, de un total de: 3. estoy en maquina2b
1 additional process aborted (not shown)
1 process killed (possibly by Open MPI)

##

file 3errpruebaMPIerr3

mpirun noticed that job rank 0 with PID 10839 on node maquina3b exited on signal 15 (Terminated).

---------------------------
I have limited time of cpu at 60 seconds in all nodes. Torque modify this limit only for maquina3b.
I think that torque should modify cpu's limit in the resf of nodes.
where is the error?
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to