Hooray!! Great to hear - I was running out of ideas :-)
On Dec 19, 2012, at 2:01 PM, Daniel Davidson wrote:
> I figured this out.
>
> ssh was working, but scp was not due to an mtu mismatch between the systems.
> Adding MTU=1500 to my /etc/sysconfig/network-scripts/ifcfg-eth2 fixed the
> pro
I figured this out.
ssh was working, but scp was not due to an mtu mismatch between the
systems. Adding MTU=1500 to my
/etc/sysconfig/network-scripts/ifcfg-eth2 fixed the problem.
Dan
On 12/17/2012 04:12 PM, Daniel Davidson wrote:
Yes, it does.
Dan
[root@compute-2-1 ~]# ssh compute-2-0
W
Yes, it does.
Dan
[root@compute-2-1 ~]# ssh compute-2-0
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.
Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local
[root@compute-2-0 ~]# ssh co
Daniel,
Does passwordless ssh work. You need to make sure that it is.
Doug
On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote:
> I would also add that scp seems to be creating the file in the /tmp directory
> of compute-2-0, and that /var/log secure is showing ssh connections being
> accepted.
I would also add that scp seems to be creating the file in the /tmp
directory of compute-2-0, and that /var/log secure is showing ssh
connections being accepted. Is there anything in ssh that can limit
connections that I need to look out for? My guess is that it is part of
the client prefs an
A very long time (15 mintues or so) I finally received the following in
addition to what I just sent earlier:
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on
WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on
WILDCARD
[compute-2-0.local:246
Hmmm...and that is ALL the output? If so, then it never succeeded in sending a
message back, which leads one to suspect some kind of firewall in the way.
Looking at the ssh line, we are going to attempt to send a message from tnode
2-0 to node 2-1 on the 10.1.255.226 address. Is that going to wo
These nodes have not been locked down yet so that jobs cannot be
launched from the backend, at least on purpose anyway. The added
logging returns the information below:
[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 10 --leave-session-attached
?? That was all the output? If so, then something is indeed quite wrong as it
didn't even attempt to launch the job.
Try adding -mca plm_base_verbose 5 to the cmd line.
I was assuming you were using ssh as the launcher, but I wonder if you are in
some managed environment? If so, then it could b
This looks to be having issues as well, and I cannot get any number of
processors to give me a different result with the new version.
[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 50 --leave-session-attached -mca
odls_base_verbose 5 hostname
[
I will give this a try, but wouldn't that be an issue as well if the
process was run on the head node or another node? So long as the mpi
job is not started on either of these two nodes, it works fine.
Dan
On 12/14/2012 11:46 PM, Ralph Castain wrote:
It must be making contact or ORTE wouldn'
It must be making contact or ORTE wouldn't be attempting to launch your
application's procs. Looks more like it never received the launch command.
Looking at the code, I suspect you're getting caught in a race condition that
causes the message to get "stuck".
Just to see if that's the case, you
Thank you for the help so far. Here is the information that the
debugging gives me. Looks like the daemon on on the non-local node
never makes contact. If I step NP back two though, it does.
Dan
[root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -v -
Sorry - I forgot that you built from a tarball, and so debug isn't enabled by
default. You need to configure --enable-debug.
On Dec 14, 2012, at 1:52 PM, Daniel Davidson wrote:
> Oddly enough, adding this debugging info, lowered the number of processes
> that can be used down to 42 from 46. W
Oddly enough, adding this debugging info, lowered the number of
processes that can be used down to 42 from 46. When I run the MPI, it
fails giving only the information that follows:
[root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 44 --leave-se
It wouldn't be ssh - in both cases, only one ssh is being done to each node (to
start the local daemon). The only difference is the number of fork/exec's being
done on each node, and the number of file descriptors being opened to support
those fork/exec's.
It certainly looks like your limits ar
I have had to cobble together two machines in our rocks cluster without
using the standard installation, they have efi only bios on them and
rocks doesnt like that, so it is the only workaround.
Everything works great now, except for one thing. MPI jobs (openmpi or
mpich) fail when started fr
oh thank you ! that might work
On Thu, Apr 7, 2011 at 5:31 AM, Terry Dontje wrote:
> Nehemiah,
> I took a look at an old version of a hpl Makefile I have. I think what you
> really want to do is not set the MP* variables to anything and near the end
> of the Makefile set CC and LINKER to mpicc.
Nehemiah,
I took a look at an old version of a hpl Makefile I have. I think what
you really want to do is not set the MP* variables to anything and near
the end of the Makefile set CC and LINKER to mpicc. You may need to
also change the CFLAGS and LINKERFLAGS variables to match which
compile
On 04/06/2011 03:38 PM, Nehemiah Dacres wrote:
I am also trying to get netlib's hpl to run via sun cluster tools so i
am trying to compile it and am having trouble. Which is the proper mpi
library to give?
naturally this isn't going to work
MPdir= /opt/SUNWhpc/HPC8.2.1c/sun/
MPinc
Sigh...look at the output of mpicc --showme. It tells you where the OMPI libs
were installed:
-I/opt/SUNWhpc/HPC8.2.1c/sun/include/64
-I/opt/SUNWhpc/HPC8.2.1c/sun/include/64/openmpi -R/opt/mx/lib/lib64
-R/opt/SUNWhpc/HPC8.2.1c/sun/lib/lib64 -L/opt/SUNWhpc/HPC8.2.1c/sun/lib/lib64
-lmpi -lopen-rt
[jian@therock lib]$ ls lib64/*.a
lib64/libotf.a lib64/libvt.fmpi.a lib64/libvt.omp.a
lib64/libvt.a lib64/libvt.mpi.a lib64/libvt.ompi.a
last time i linked one of those files it told me they were in the wrong
format. these are in archive format, what format should they be in?
On Wed, Apr 6,
Look at your output from mpicc --showme. It indicates that the OMPI libs were
put in the lib64 directory, not lib.
On Apr 6, 2011, at 1:38 PM, Nehemiah Dacres wrote:
> I am also trying to get netlib's hpl to run via sun cluster tools so i am
> trying to compile it and am having trouble. Which
I am also trying to get netlib's hpl to run via sun cluster tools so i am
trying to compile it and am having trouble. Which is the proper mpi library
to give?
naturally this isn't going to work
MPdir= /opt/SUNWhpc/HPC8.2.1c/sun/
MPinc= -I$(MPdir)/include
*MPlib= $(MPdir)/li
Something looks fishy about your numbers. The first two sets of numbers
look the same and the last set do look better for the most part. Your
mpirun command line looks weird to me with the "-mca
orte_base_help_aggregate btl,openib,self," did something get chopped off
with the text copy? You
Nehemiah Dacres wrote:
also, I'm not sure if I'm reading the results right.
According to the last run, did using the sun compilers (update 1 )
result in higher performance with sunct?
On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah
Dacres
wrote:
this
first test was run as
also, I'm not sure if I'm reading the results right. According to the last
run, did using the sun compilers (update 1 ) result in higher performance
with sunct?
On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah Dacres wrote:
> some tests I did. I hope this isn't an abuse of the list. please tell me if
some tests I did. I hope this isn't an abuse of the list. please tell me if
it is but thanks to all those who helped me.
this goes to say that the sun MPI works with programs not compiled with
sun’s compilers.
this first test was run as a base case to see if MPI works., the sedcond run
is to see
On Mon, Apr 4, 2011 at 7:35 PM, Terry Dontje wrote:
> libfui.so is a library a part of the Solaris Studio FORTRAN tools. It
> should be located under lib from where your Solaris Studio compilers are
> installed from. So one question is whether you actually have Studio Fortran
> installed on all
thanks all, I realized that the sun compilers weren't installed on all the
nodes. It seems to be working, soon I will test the mca parameters for IB
On Mon, Apr 4, 2011 at 7:35 PM, Terry Dontje wrote:
> libfui.so is a library a part of the Solaris Studio FORTRAN tools. It
> should be located
libfui.so is a library a part of the Solaris Studio FORTRAN tools. It
should be located under lib from where your Solaris Studio compilers are
installed from. So one question is whether you actually have Studio
Fortran installed on all your nodes or not?
--td
On 04/04/2011 04:02 PM, Ralph C
What does 'ldd ring2' show? How was it compiled?
--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Apr 4, 2011, at 1:58 PM, Nehemiah Dacres wrote:
[jian@therock ~]$ echo $LD_LIBRARY_PATH
/opt/sun/sunstudio12.1/lib:/opt/vtk/lib:/opt/gridengine/lib/lx26-
amd64:/opt/gridengine/lib/lx26-a
Well, where is libfui located? Is that location in your ld path? Is the lib
present on all nodes in your hostfile?
On Apr 4, 2011, at 1:58 PM, Nehemiah Dacres wrote:
> [jian@therock ~]$ echo $LD_LIBRARY_PATH
> /opt/sun/sunstudio12.1/lib:/opt/vtk/lib:/opt/gridengine/lib/lx26-amd64:/opt/gridengin
[jian@therock ~]$ echo $LD_LIBRARY_PATH
/opt/sun/sunstudio12.1/lib:/opt/vtk/lib:/opt/gridengine/lib/lx26-amd64:/opt/gridengine/lib/lx26-amd64:/home/jian/.crlibs:/home/jian/.crlibs32
[jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 4 -hostfile
list ring2
ring2: error while loading shared
Hi,
Try prepending the path to your compiler libraries.
Example (bash-like):
export LD_LIBRARY_PATH=/compiler/prefix/lib:/ompi/prefix/lib:
$LD_LIBRARY_PATH
--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Apr 4, 2011, at 1:33 PM, Nehemiah Dacres wrote:
altering LD_LIBRARY_PATH alt
I don't know what libfui.so.1 is, but this FAQ entry may answer your
question...?
http://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0
On Apr 4, 2011, at 3:33 PM, Nehemiah Dacres wrote:
> altering LD_LIBRARY_PATH alter's the process's path to mpi's libraries, how
>
altering LD_LIBRARY_PATH alter's the process's path to mpi's libraries, how
do i alter its path to compiler libs like libfui.so.1? it needs to find them
cause it was compiled by a sun compiler
On Mon, Apr 4, 2011 at 10:06 AM, Nehemiah Dacres wrote:
>
> As Ralph indicated, he'll add the hostname
> As Ralph indicated, he'll add the hostname to the error message (but that
> might be tricky; that error message is coming from rsh/ssh...).
>
> In the meantime, you might try (csh style):
>
> foreach host (`cat list`)
>echo $host
>ls -l /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted
> end
>
>
that'
that's an excellent suggestion
On Mon, Apr 4, 2011 at 9:45 AM, Jeff Squyres wrote:
> As Ralph indicated, he'll add the hostname to the error message (but that
> might be tricky; that error message is coming from rsh/ssh...).
>
> In the meantime, you might try (csh style):
>
> foreach host (`cat
On Apr 4, 2011, at 8:42 AM, Nehemiah Dacres wrote:
> you do realize that this is Sun Cluster Tools branch (it is a branch right?
> or is it a *port* of openmpi to sun's compilers?) I'm not sure if your
> changes made it into sunct 8.2.1
My point was that the error message currently doesn't in
As Ralph indicated, he'll add the hostname to the error message (but that might
be tricky; that error message is coming from rsh/ssh...).
In the meantime, you might try (csh style):
foreach host (`cat list`)
echo $host
ls -l /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted
end
On Apr 4, 2011, a
you do realize that this is Sun Cluster Tools branch (it is a branch right?
or is it a *port* of openmpi to sun's compilers?) I'm not sure if your
changes made it into sunct 8.2.1
On Mon, Apr 4, 2011 at 9:34 AM, Ralph Castain wrote:
> Guess I can/will add the node name to the error message - sho
Guess I can/will add the node name to the error message - should have been
there before now.
If it is a debug build, you can add "-mca plm_base_verbose 1" to the cmd line
and get output tracing the launch and showing you what nodes are having
problems.
On Apr 4, 2011, at 8:24 AM, Nehemiah Dac
I have installed it via a symlink on all of the nodes, I can go 'tentakel
which mpirun ' and it finds it' I'll check the library paths but isn't there
a way to find out which nodes are returning the error?
On Thu, Mar 31, 2011 at 7:30 AM, Jeff Squyres wrote:
> The error message seems to imply t
The error message seems to imply that you don't have OMPI installed on all your
nodes (because it didn't find /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted on a remote
node).
On Mar 30, 2011, at 4:24 PM, Nehemiah Dacres wrote:
> I am trying to figure out why my jobs aren't getting distributed and need
As one of the error message suggests, you need to add the openmpi library to
your LD_LIBRARY_PATH to all your nodes.
On Wed, Mar 30, 2011 at 1:24 PM, Nehemiah Dacres wrote:
> I am trying to figure out why my jobs aren't getting distributed and need
> some help. I have an install of sun cluster t
I am trying to figure out why my jobs aren't getting distributed and need
some help. I have an install of sun cluster tools on Rockscluster 5.2
(essentially centos4u2). this user's account has its home dir shared via
nfs. I am getting some strange errors. here's an example run
[jian@therock ~]$ /
47 matches
Mail list logo