Re: [OMPI users] mpi problems/many cpus per node

2012-12-19 Thread Ralph Castain
Hooray!! Great to hear - I was running out of ideas :-) On Dec 19, 2012, at 2:01 PM, Daniel Davidson wrote: > I figured this out. > > ssh was working, but scp was not due to an mtu mismatch between the systems. > Adding MTU=1500 to my /etc/sysconfig/network-scripts/ifcfg-eth2 fixed the > pro

Re: [OMPI users] mpi problems/many cpus per node

2012-12-19 Thread Daniel Davidson
I figured this out. ssh was working, but scp was not due to an mtu mismatch between the systems. Adding MTU=1500 to my /etc/sysconfig/network-scripts/ifcfg-eth2 fixed the problem. Dan On 12/17/2012 04:12 PM, Daniel Davidson wrote: Yes, it does. Dan [root@compute-2-1 ~]# ssh compute-2-0 W

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
Yes, it does. Dan [root@compute-2-1 ~]# ssh compute-2-0 Warning: untrusted X11 forwarding setup failed: xauth key data not generated Warning: No xauth data; using fake authentication data for X11 forwarding. Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local [root@compute-2-0 ~]# ssh co

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Doug Reeder
Daniel, Does passwordless ssh work. You need to make sure that it is. Doug On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote: > I would also add that scp seems to be creating the file in the /tmp directory > of compute-2-0, and that /var/log secure is showing ssh connections being > accepted.

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
I would also add that scp seems to be creating the file in the /tmp directory of compute-2-0, and that /var/log secure is showing ssh connections being accepted. Is there anything in ssh that can limit connections that I need to look out for? My guess is that it is part of the client prefs an

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
A very long time (15 mintues or so) I finally received the following in addition to what I just sent earlier: [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:246

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Ralph Castain
Hmmm...and that is ALL the output? If so, then it never succeeded in sending a message back, which leads one to suspect some kind of firewall in the way. Looking at the ssh line, we are going to attempt to send a message from tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to wo

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
These nodes have not been locked down yet so that jobs cannot be launched from the backend, at least on purpose anyway. The added logging returns the information below: [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 10 --leave-session-attached

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Ralph Castain
?? That was all the output? If so, then something is indeed quite wrong as it didn't even attempt to launch the job. Try adding -mca plm_base_verbose 5 to the cmd line. I was assuming you were using ssh as the launcher, but I wonder if you are in some managed environment? If so, then it could b

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
This looks to be having issues as well, and I cannot get any number of processors to give me a different result with the new version. [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 50 --leave-session-attached -mca odls_base_verbose 5 hostname [

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
I will give this a try, but wouldn't that be an issue as well if the process was run on the head node or another node? So long as the mpi job is not started on either of these two nodes, it works fine. Dan On 12/14/2012 11:46 PM, Ralph Castain wrote: It must be making contact or ORTE wouldn'

Re: [OMPI users] mpi problems/many cpus per node

2012-12-15 Thread Ralph Castain
It must be making contact or ORTE wouldn't be attempting to launch your application's procs. Looks more like it never received the launch command. Looking at the code, I suspect you're getting caught in a race condition that causes the message to get "stuck". Just to see if that's the case, you

Re: [OMPI users] mpi problems/many cpus per node

2012-12-14 Thread Daniel Davidson
Thank you for the help so far. Here is the information that the debugging gives me. Looks like the daemon on on the non-local node never makes contact. If I step NP back two though, it does. Dan [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -v -

Re: [OMPI users] mpi problems/many cpus per node

2012-12-14 Thread Ralph Castain
Sorry - I forgot that you built from a tarball, and so debug isn't enabled by default. You need to configure --enable-debug. On Dec 14, 2012, at 1:52 PM, Daniel Davidson wrote: > Oddly enough, adding this debugging info, lowered the number of processes > that can be used down to 42 from 46. W

Re: [OMPI users] mpi problems/many cpus per node

2012-12-14 Thread Daniel Davidson
Oddly enough, adding this debugging info, lowered the number of processes that can be used down to 42 from 46. When I run the MPI, it fails giving only the information that follows: [root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -v -np 44 --leave-se

Re: [OMPI users] mpi problems/many cpus per node

2012-12-14 Thread Ralph Castain
It wouldn't be ssh - in both cases, only one ssh is being done to each node (to start the local daemon). The only difference is the number of fork/exec's being done on each node, and the number of file descriptors being opened to support those fork/exec's. It certainly looks like your limits ar

[OMPI users] mpi problems/many cpus per node

2012-12-14 Thread Daniel Davidson
I have had to cobble together two machines in our rocks cluster without using the standard installation, they have efi only bios on them and rocks doesnt like that, so it is the only workaround. Everything works great now, except for one thing. MPI jobs (openmpi or mpich) fail when started fr

Re: [OMPI users] mpi problems,

2011-04-07 Thread Nehemiah Dacres
oh thank you ! that might work On Thu, Apr 7, 2011 at 5:31 AM, Terry Dontje wrote: > Nehemiah, > I took a look at an old version of a hpl Makefile I have. I think what you > really want to do is not set the MP* variables to anything and near the end > of the Makefile set CC and LINKER to mpicc.

Re: [OMPI users] mpi problems,

2011-04-07 Thread Terry Dontje
Nehemiah, I took a look at an old version of a hpl Makefile I have. I think what you really want to do is not set the MP* variables to anything and near the end of the Makefile set CC and LINKER to mpicc. You may need to also change the CFLAGS and LINKERFLAGS variables to match which compile

Re: [OMPI users] mpi problems,

2011-04-07 Thread Terry Dontje
On 04/06/2011 03:38 PM, Nehemiah Dacres wrote: I am also trying to get netlib's hpl to run via sun cluster tools so i am trying to compile it and am having trouble. Which is the proper mpi library to give? naturally this isn't going to work MPdir= /opt/SUNWhpc/HPC8.2.1c/sun/ MPinc

Re: [OMPI users] mpi problems,

2011-04-06 Thread Ralph Castain
Sigh...look at the output of mpicc --showme. It tells you where the OMPI libs were installed: -I/opt/SUNWhpc/HPC8.2.1c/sun/include/64 -I/opt/SUNWhpc/HPC8.2.1c/sun/include/64/openmpi -R/opt/mx/lib/lib64 -R/opt/SUNWhpc/HPC8.2.1c/sun/lib/lib64 -L/opt/SUNWhpc/HPC8.2.1c/sun/lib/lib64 -lmpi -lopen-rt

Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
[jian@therock lib]$ ls lib64/*.a lib64/libotf.a lib64/libvt.fmpi.a lib64/libvt.omp.a lib64/libvt.a lib64/libvt.mpi.a lib64/libvt.ompi.a last time i linked one of those files it told me they were in the wrong format. these are in archive format, what format should they be in? On Wed, Apr 6,

Re: [OMPI users] mpi problems,

2011-04-06 Thread Ralph Castain
Look at your output from mpicc --showme. It indicates that the OMPI libs were put in the lib64 directory, not lib. On Apr 6, 2011, at 1:38 PM, Nehemiah Dacres wrote: > I am also trying to get netlib's hpl to run via sun cluster tools so i am > trying to compile it and am having trouble. Which

Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
I am also trying to get netlib's hpl to run via sun cluster tools so i am trying to compile it and am having trouble. Which is the proper mpi library to give? naturally this isn't going to work MPdir= /opt/SUNWhpc/HPC8.2.1c/sun/ MPinc= -I$(MPdir)/include *MPlib= $(MPdir)/li

Re: [OMPI users] mpi problems,

2011-04-06 Thread Terry Dontje
Something looks fishy about your numbers. The first two sets of numbers look the same and the last set do look better for the most part. Your mpirun command line looks weird to me with the "-mca orte_base_help_aggregate btl,openib,self," did something get chopped off with the text copy? You

Re: [OMPI users] mpi problems,

2011-04-06 Thread Eugene Loh
Nehemiah Dacres wrote: also, I'm not sure if I'm reading the results right. According to the last run, did using the sun compilers (update 1 )  result in higher performance with sunct? On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah Dacres wrote: this first test was run as

Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
also, I'm not sure if I'm reading the results right. According to the last run, did using the sun compilers (update 1 ) result in higher performance with sunct? On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah Dacres wrote: > some tests I did. I hope this isn't an abuse of the list. please tell me if

Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
some tests I did. I hope this isn't an abuse of the list. please tell me if it is but thanks to all those who helped me. this goes to say that the sun MPI works with programs not compiled with sun’s compilers. this first test was run as a base case to see if MPI works., the sedcond run is to see

Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
On Mon, Apr 4, 2011 at 7:35 PM, Terry Dontje wrote: > libfui.so is a library a part of the Solaris Studio FORTRAN tools. It > should be located under lib from where your Solaris Studio compilers are > installed from. So one question is whether you actually have Studio Fortran > installed on all

Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
thanks all, I realized that the sun compilers weren't installed on all the nodes. It seems to be working, soon I will test the mca parameters for IB On Mon, Apr 4, 2011 at 7:35 PM, Terry Dontje wrote: > libfui.so is a library a part of the Solaris Studio FORTRAN tools. It > should be located

Re: [OMPI users] mpi problems,

2011-04-04 Thread Terry Dontje
libfui.so is a library a part of the Solaris Studio FORTRAN tools. It should be located under lib from where your Solaris Studio compilers are installed from. So one question is whether you actually have Studio Fortran installed on all your nodes or not? --td On 04/04/2011 04:02 PM, Ralph C

Re: [OMPI users] mpi problems,

2011-04-04 Thread Samuel K. Gutierrez
What does 'ldd ring2' show? How was it compiled? -- Samuel K. Gutierrez Los Alamos National Laboratory On Apr 4, 2011, at 1:58 PM, Nehemiah Dacres wrote: [jian@therock ~]$ echo $LD_LIBRARY_PATH /opt/sun/sunstudio12.1/lib:/opt/vtk/lib:/opt/gridengine/lib/lx26- amd64:/opt/gridengine/lib/lx26-a

Re: [OMPI users] mpi problems,

2011-04-04 Thread Ralph Castain
Well, where is libfui located? Is that location in your ld path? Is the lib present on all nodes in your hostfile? On Apr 4, 2011, at 1:58 PM, Nehemiah Dacres wrote: > [jian@therock ~]$ echo $LD_LIBRARY_PATH > /opt/sun/sunstudio12.1/lib:/opt/vtk/lib:/opt/gridengine/lib/lx26-amd64:/opt/gridengin

Re: [OMPI users] mpi problems,

2011-04-04 Thread Nehemiah Dacres
[jian@therock ~]$ echo $LD_LIBRARY_PATH /opt/sun/sunstudio12.1/lib:/opt/vtk/lib:/opt/gridengine/lib/lx26-amd64:/opt/gridengine/lib/lx26-amd64:/home/jian/.crlibs:/home/jian/.crlibs32 [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 4 -hostfile list ring2 ring2: error while loading shared

Re: [OMPI users] mpi problems,

2011-04-04 Thread Samuel K. Gutierrez
Hi, Try prepending the path to your compiler libraries. Example (bash-like): export LD_LIBRARY_PATH=/compiler/prefix/lib:/ompi/prefix/lib: $LD_LIBRARY_PATH -- Samuel K. Gutierrez Los Alamos National Laboratory On Apr 4, 2011, at 1:33 PM, Nehemiah Dacres wrote: altering LD_LIBRARY_PATH alt

Re: [OMPI users] mpi problems,

2011-04-04 Thread Jeff Squyres
I don't know what libfui.so.1 is, but this FAQ entry may answer your question...? http://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0 On Apr 4, 2011, at 3:33 PM, Nehemiah Dacres wrote: > altering LD_LIBRARY_PATH alter's the process's path to mpi's libraries, how >

Re: [OMPI users] mpi problems,

2011-04-04 Thread Nehemiah Dacres
altering LD_LIBRARY_PATH alter's the process's path to mpi's libraries, how do i alter its path to compiler libs like libfui.so.1? it needs to find them cause it was compiled by a sun compiler On Mon, Apr 4, 2011 at 10:06 AM, Nehemiah Dacres wrote: > > As Ralph indicated, he'll add the hostname

Re: [OMPI users] mpi problems,

2011-04-04 Thread Nehemiah Dacres
> As Ralph indicated, he'll add the hostname to the error message (but that > might be tricky; that error message is coming from rsh/ssh...). > > In the meantime, you might try (csh style): > > foreach host (`cat list`) >echo $host >ls -l /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted > end > > that'

Re: [OMPI users] mpi problems,

2011-04-04 Thread Nehemiah Dacres
that's an excellent suggestion On Mon, Apr 4, 2011 at 9:45 AM, Jeff Squyres wrote: > As Ralph indicated, he'll add the hostname to the error message (but that > might be tricky; that error message is coming from rsh/ssh...). > > In the meantime, you might try (csh style): > > foreach host (`cat

Re: [OMPI users] mpi problems,

2011-04-04 Thread Ralph Castain
On Apr 4, 2011, at 8:42 AM, Nehemiah Dacres wrote: > you do realize that this is Sun Cluster Tools branch (it is a branch right? > or is it a *port* of openmpi to sun's compilers?) I'm not sure if your > changes made it into sunct 8.2.1 My point was that the error message currently doesn't in

Re: [OMPI users] mpi problems,

2011-04-04 Thread Jeff Squyres
As Ralph indicated, he'll add the hostname to the error message (but that might be tricky; that error message is coming from rsh/ssh...). In the meantime, you might try (csh style): foreach host (`cat list`) echo $host ls -l /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted end On Apr 4, 2011, a

Re: [OMPI users] mpi problems,

2011-04-04 Thread Nehemiah Dacres
you do realize that this is Sun Cluster Tools branch (it is a branch right? or is it a *port* of openmpi to sun's compilers?) I'm not sure if your changes made it into sunct 8.2.1 On Mon, Apr 4, 2011 at 9:34 AM, Ralph Castain wrote: > Guess I can/will add the node name to the error message - sho

Re: [OMPI users] mpi problems,

2011-04-04 Thread Ralph Castain
Guess I can/will add the node name to the error message - should have been there before now. If it is a debug build, you can add "-mca plm_base_verbose 1" to the cmd line and get output tracing the launch and showing you what nodes are having problems. On Apr 4, 2011, at 8:24 AM, Nehemiah Dac

Re: [OMPI users] mpi problems,

2011-04-04 Thread Nehemiah Dacres
I have installed it via a symlink on all of the nodes, I can go 'tentakel which mpirun ' and it finds it' I'll check the library paths but isn't there a way to find out which nodes are returning the error? On Thu, Mar 31, 2011 at 7:30 AM, Jeff Squyres wrote: > The error message seems to imply t

Re: [OMPI users] mpi problems,

2011-03-31 Thread Jeff Squyres
The error message seems to imply that you don't have OMPI installed on all your nodes (because it didn't find /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted on a remote node). On Mar 30, 2011, at 4:24 PM, Nehemiah Dacres wrote: > I am trying to figure out why my jobs aren't getting distributed and need

Re: [OMPI users] mpi problems,

2011-03-30 Thread David Zhang
As one of the error message suggests, you need to add the openmpi library to your LD_LIBRARY_PATH to all your nodes. On Wed, Mar 30, 2011 at 1:24 PM, Nehemiah Dacres wrote: > I am trying to figure out why my jobs aren't getting distributed and need > some help. I have an install of sun cluster t

[OMPI users] mpi problems,

2011-03-30 Thread Nehemiah Dacres
I am trying to figure out why my jobs aren't getting distributed and need some help. I have an install of sun cluster tools on Rockscluster 5.2 (essentially centos4u2). this user's account has its home dir shared via nfs. I am getting some strange errors. here's an example run [jian@therock ~]$ /