I have had to cobble together two machines in our rocks cluster without
using the standard installation, they have efi only bios on them and
rocks doesnt like that, so it is the only workaround.
Everything works great now, except for one thing. MPI jobs (openmpi or
mpich) fail when started from one of these nodes (via qsub or by logging
in and running the command) if 24 or more processors are needed on
another system. However if the originator of the MPI job is the
headnode or any of the preexisting compute nodes, it works fine. Right
now I am guessing ssh client or ulimit problems, but I cannot find any
difference. Any help would be greatly appreciated.
compute-2-1 and compute-2-0 are the new nodes
Examples:
This works, prints 23 hostnames from each machine:
[root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 46 hostname
This does not work, prints 24 hostnames for compute-2-1
[root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 48 hostname
These both work, print 64 hostnames from each node
[root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 128 hostname
[root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 128 hostname
[root@compute-2-1 ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16410016
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
[root@compute-2-1 ~]# more /etc/ssh/ssh_config
Host *
CheckHostIP no
ForwardX11 yes
ForwardAgent yes
StrictHostKeyChecking no
UsePrivilegedPort no
Protocol 2,1