It wouldn't be ssh - in both cases, only one ssh is being done to each node (to 
start the local daemon). The only difference is the number of fork/exec's being 
done on each node, and the number of file descriptors being opened to support 
those fork/exec's.

It certainly looks like your limits are high enough. When you say it "fails", 
what do you mean - what error does it report? Try adding:

--leave-session-attached -mca odls_base_verbose 5

to your cmd line - this will report all the local proc launch debug and 
hopefully show you a more detailed error report.


On Dec 14, 2012, at 12:29 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:

> I have had to cobble together two machines in our rocks cluster without using 
> the standard installation, they have efi only bios on them and rocks doesnt 
> like that, so it is the only workaround.
> 
> Everything works great now, except for one thing.  MPI jobs (openmpi or 
> mpich) fail when started from one of these nodes (via qsub or by logging in 
> and running the command) if 24 or more processors are needed on another 
> system.  However if the originator of the MPI job is the headnode or any of 
> the preexisting compute nodes, it works fine.  Right now I am guessing ssh 
> client or ulimit problems, but I cannot find any difference.  Any help would 
> be greatly appreciated.
> 
> compute-2-1 and compute-2-0 are the new nodes
> 
> Examples:
> 
> This works, prints 23 hostnames from each machine:
> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
> compute-2-0,compute-2-1 -np 46 hostname
> 
> This does not work, prints 24 hostnames for compute-2-1
> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
> compute-2-0,compute-2-1 -np 48 hostname
> 
> These both work, print 64 hostnames from each node
> [root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
> compute-2-0,compute-2-1 -np 128 hostname
> [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
> compute-2-0,compute-2-1 -np 128 hostname
> 
> [root@compute-2-1 ~]# ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 16410016
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 4096
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) unlimited
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 1024
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> 
> [root@compute-2-1 ~]# more /etc/ssh/ssh_config
> Host *
>        CheckHostIP             no
>        ForwardX11              yes
>        ForwardAgent            yes
>        StrictHostKeyChecking   no
>        UsePrivilegedPort       no
>        Protocol                2,1
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to