It wouldn't be ssh - in both cases, only one ssh is being done to each node (to start the local daemon). The only difference is the number of fork/exec's being done on each node, and the number of file descriptors being opened to support those fork/exec's.
It certainly looks like your limits are high enough. When you say it "fails", what do you mean - what error does it report? Try adding: --leave-session-attached -mca odls_base_verbose 5 to your cmd line - this will report all the local proc launch debug and hopefully show you a more detailed error report. On Dec 14, 2012, at 12:29 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote: > I have had to cobble together two machines in our rocks cluster without using > the standard installation, they have efi only bios on them and rocks doesnt > like that, so it is the only workaround. > > Everything works great now, except for one thing. MPI jobs (openmpi or > mpich) fail when started from one of these nodes (via qsub or by logging in > and running the command) if 24 or more processors are needed on another > system. However if the originator of the MPI job is the headnode or any of > the preexisting compute nodes, it works fine. Right now I am guessing ssh > client or ulimit problems, but I cannot find any difference. Any help would > be greatly appreciated. > > compute-2-1 and compute-2-0 are the new nodes > > Examples: > > This works, prints 23 hostnames from each machine: > [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host > compute-2-0,compute-2-1 -np 46 hostname > > This does not work, prints 24 hostnames for compute-2-1 > [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host > compute-2-0,compute-2-1 -np 48 hostname > > These both work, print 64 hostnames from each node > [root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host > compute-2-0,compute-2-1 -np 128 hostname > [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host > compute-2-0,compute-2-1 -np 128 hostname > > [root@compute-2-1 ~]# ulimit -a > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 16410016 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 4096 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) unlimited > cpu time (seconds, -t) unlimited > max user processes (-u) 1024 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > [root@compute-2-1 ~]# more /etc/ssh/ssh_config > Host * > CheckHostIP no > ForwardX11 yes > ForwardAgent yes > StrictHostKeyChecking no > UsePrivilegedPort no > Protocol 2,1 > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users