I have install torque-2.3.0 and openmpi-1.2.3. I make tests and I have discovered that the jobs launched with the parameter '-hostfile' or '-machinefile' stops are to exceed the limits in the file /etc/security/limits.conf More details:
file hola.c #include <stdio.h> #include <unistd.h> #include "mpi.h" int main(int argc, char *argv[]){ int rank; int size; int i; int namelen; char pn[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&size); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Get_processor_name(pn,&namelen); sleep(rank); system("bash -c 'ulimit -a'"); for (i=0;;i++) { if (i%100000000==0) { printf("--> %i --> Hola desde %d, de un total de: %d. estoy en %s\n",i, rank, size,pn); } } MPI_Finalize(); return 0; } ## > mpicc hola.c file mpi3.sh #!/bin/sh #PBS -l nodes=3:ppn=1 #PBS -N pruebaMPI3 #PBS -o 3outpruebaMPIout3 #PBS -e 3errpruebaMPIerr3 cat ${PBS_NODEFILE} mpirun -hostfile ${PBS_NODEFILE} /home/javier/mpi_hola/a.out ## launch job with torque > qsub mpi3.sh ## termined file 3outpruebaMPIout3 maquina3b maquina2b maquina1b core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 8185 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) 8192 *cpu time (seconds, -t) unlimited #limit maquina3b* max user processes (-u) 8185 virtual memory (kbytes, -v) 2511840 file locks (-x) unlimited --> 0 --> Hola desde 0, de un total de: 3. estoy en maquina3b core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 8185 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) 880005 open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) 8192 *cpu time (seconds, -t) 60 #limit maquina2b* max user processes (-u) 8185 virtual memory (kbytes, -v) 2511840 file locks (-x) unlimited --> 0 --> Hola desde 1, de un total de: 3. estoy en maquina2b --> 100000000 --> Hola desde 0, de un total de: 3. estoy en maquina3b core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 8185 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) 880005 open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) 8192 *cpu time (seconds, -t) 60 #limit maquina1b* max user processes (-u) 8185 virtual memory (kbytes, -v) 2511840 file locks (-x) unlimited --> 0 --> Hola desde 2, de un total de: 3. estoy en maquina1b --> 100000000 --> Hola desde 1, de un total de: 3. estoy en maquina2b --> 200000000 --> Hola desde 0, de un total de: 3. estoy en maquina3b --> 100000000 --> Hola desde 2, de un total de: 3. estoy en maquina1b --> 200000000 --> Hola desde 1, de un total de: 3. estoy en maquina2b ........ --> -500000000 --> Hola desde 1, de un total de: 3. estoy en maquina2b *1 additional process aborted (not shown) 1 process killed (possibly by Open MPI)* ## file 3errpruebaMPIerr3 mpirun noticed that job rank 0 with PID 10839 on node maquina3b exited on signal 15 (Terminated). --------------------------- I have limited time of cpu at 60 seconds in all nodes. Torque modify this limit only for maquina3b. I think that torque should modify cpu's limit in the resf of nodes. where is the error?