I have had to cobble together two machines in our rocks cluster without using the standard installation, they have efi only bios on them and rocks doesnt like that, so it is the only workaround.

Everything works great now, except for one thing. MPI jobs (openmpi or mpich) fail when started from one of these nodes (via qsub or by logging in and running the command) if 24 or more processors are needed on another system. However if the originator of the MPI job is the headnode or any of the preexisting compute nodes, it works fine. Right now I am guessing ssh client or ulimit problems, but I cannot find any difference. Any help would be greatly appreciated.

compute-2-1 and compute-2-0 are the new nodes

Examples:

This works, prints 23 hostnames from each machine:
[root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -np 46 hostname

This does not work, prints 24 hostnames for compute-2-1
[root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -np 48 hostname

These both work, print 64 hostnames from each node
[root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -np 128 hostname [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -np 128 hostname

[root@compute-2-1 ~]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 16410016
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

[root@compute-2-1 ~]# more /etc/ssh/ssh_config
Host *
        CheckHostIP             no
        ForwardX11              yes
        ForwardAgent            yes
        StrictHostKeyChecking   no
        UsePrivilegedPort       no
        Protocol                2,1

Reply via email to