Re: [OMPI users] mpi problems/many cpus per node

2012-12-19 Thread Ralph Castain
Hooray!! Great to hear - I was running out of ideas :-) On Dec 19, 2012, at 2:01 PM, Daniel Davidson wrote: > I figured this out. > > ssh was working, but scp was not due to an mtu mismatch between the systems. > Adding MTU=1500 to my /etc/sysconfig/network-scripts/ifcfg-eth2 fixed the > pro

Re: [OMPI users] mpi problems/many cpus per node

2012-12-19 Thread Daniel Davidson
I figured this out. ssh was working, but scp was not due to an mtu mismatch between the systems. Adding MTU=1500 to my /etc/sysconfig/network-scripts/ifcfg-eth2 fixed the problem. Dan On 12/17/2012 04:12 PM, Daniel Davidson wrote: Yes, it does. Dan [root@compute-2-1 ~]# ssh compute-2-0 W

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
Yes, it does. Dan [root@compute-2-1 ~]# ssh compute-2-0 Warning: untrusted X11 forwarding setup failed: xauth key data not generated Warning: No xauth data; using fake authentication data for X11 forwarding. Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local [root@compute-2-0 ~]# ssh co

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Doug Reeder
Daniel, Does passwordless ssh work. You need to make sure that it is. Doug On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote: > I would also add that scp seems to be creating the file in the /tmp directory > of compute-2-0, and that /var/log secure is showing ssh connections being > accepted.

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
I would also add that scp seems to be creating the file in the /tmp directory of compute-2-0, and that /var/log secure is showing ssh connections being accepted. Is there anything in ssh that can limit connections that I need to look out for? My guess is that it is part of the client prefs an

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
A very long time (15 mintues or so) I finally received the following in addition to what I just sent earlier: [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:246

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Ralph Castain
Hmmm...and that is ALL the output? If so, then it never succeeded in sending a message back, which leads one to suspect some kind of firewall in the way. Looking at the ssh line, we are going to attempt to send a message from tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to wo

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
These nodes have not been locked down yet so that jobs cannot be launched from the backend, at least on purpose anyway. The added logging returns the information below: [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 10 --leave-session-attached

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Ralph Castain
?? That was all the output? If so, then something is indeed quite wrong as it didn't even attempt to launch the job. Try adding -mca plm_base_verbose 5 to the cmd line. I was assuming you were using ssh as the launcher, but I wonder if you are in some managed environment? If so, then it could b

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
This looks to be having issues as well, and I cannot get any number of processors to give me a different result with the new version. [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 50 --leave-session-attached -mca odls_base_verbose 5 hostname [

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
I will give this a try, but wouldn't that be an issue as well if the process was run on the head node or another node? So long as the mpi job is not started on either of these two nodes, it works fine. Dan On 12/14/2012 11:46 PM, Ralph Castain wrote: It must be making contact or ORTE wouldn'

Re: [OMPI users] mpi problems/many cpus per node

2012-12-15 Thread Ralph Castain
It must be making contact or ORTE wouldn't be attempting to launch your application's procs. Looks more like it never received the launch command. Looking at the code, I suspect you're getting caught in a race condition that causes the message to get "stuck". Just to see if that's the case, you

Re: [OMPI users] mpi problems/many cpus per node

2012-12-14 Thread Daniel Davidson
Thank you for the help so far. Here is the information that the debugging gives me. Looks like the daemon on on the non-local node never makes contact. If I step NP back two though, it does. Dan [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -v -

Re: [OMPI users] mpi problems/many cpus per node

2012-12-14 Thread Ralph Castain
Sorry - I forgot that you built from a tarball, and so debug isn't enabled by default. You need to configure --enable-debug. On Dec 14, 2012, at 1:52 PM, Daniel Davidson wrote: > Oddly enough, adding this debugging info, lowered the number of processes > that can be used down to 42 from 46. W

Re: [OMPI users] mpi problems/many cpus per node

2012-12-14 Thread Daniel Davidson
Oddly enough, adding this debugging info, lowered the number of processes that can be used down to 42 from 46. When I run the MPI, it fails giving only the information that follows: [root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -v -np 44 --leave-se

Re: [OMPI users] mpi problems/many cpus per node

2012-12-14 Thread Ralph Castain
It wouldn't be ssh - in both cases, only one ssh is being done to each node (to start the local daemon). The only difference is the number of fork/exec's being done on each node, and the number of file descriptors being opened to support those fork/exec's. It certainly looks like your limits ar

[OMPI users] mpi problems/many cpus per node

2012-12-14 Thread Daniel Davidson
I have had to cobble together two machines in our rocks cluster without using the standard installation, they have efi only bios on them and rocks doesnt like that, so it is the only workaround. Everything works great now, except for one thing. MPI jobs (openmpi or mpich) fail when started fr