On 2/6/2014 11:08 AM, Jeff Squyres (jsquyres) wrote:
In addition to what Ralph said (just install OMPI under your $HOME, at least
for testing purposes), here's what we say about version compatibility:
1. OMPI started providing ABI guarantees with v1.3.2. The ABI guarantee we
provide is that a 1.x and 1.(x+1) series will be ABI compatible, where x is
odd. For example, you can compile against 1.5.x and still mpirun with a 1.6.x
installation (assuming you built with shared libraries, yadda yadda yadda).
2. We have never provided any guarantees about compatibility between different
versions of OMPI (even with a 1.x series). Meaning: if you run version a.b.c
on one server, you should run a.b.c on *all* servers in your job. Wire line
compatibility is NOT guaranteed, and will likely break in either very obnoxious
or very subtle ways. Both are bad.
However, per the just-install-a-copy-in-your-$HOME advice, you can have N
different OMPI installations if you really want to. Just ensure that your PATH
and LD_LIBRARY_PATH point to the *one* that you want to use -- both on the
current server and all servers that you're using in a given job. And that
works fine (I do that all the time -- I have something like 20-30 OMPI installs
under my $HOME, all in various stages of development/debugging; I just updated
my PATH / LD_LIBARY_PATH and I'm good to go).
Make sense?
Yes. And it seems the recommended one for this purpose is 1.7, not 1.6.
What should happen if I try to transmit something big? At least in my
case it was probably under 4G, which might be some kind of boundary
(though it's a 64 bit system).
Ross
On Feb 6, 2014, at 1:23 PM, Ross Boylan <r...@biostat.ucsf.edu> wrote:
On 2/6/2014 3:24 AM, Jeff Squyres (jsquyres) wrote:
Have you tried upgrading to a newer version of Open MPI? The 1.4.x series is
several generations old. Open MPI 1.7.4 was just released yesterday.
It's on a cluster running Debian squeeze, with perhaps some upgrades to wheezy
coming. However, even wheezy is at 1.4.5 (the next generation is currently at
1.6.5). I don't administer the cluster, and upgrading basic infrastructure
seems somewhat hazardous.
I checked for backports of more recent version (at backports.debian.org) but
there don't seem to be any for squeeze or wheezy.
Can we mix later an earlier versions of MPI? The documentation at
http://www.open-mpi.org/software/ompi/versions/ seems to indicate that 1.4, 1.6
and 1.7 would all be binary incompatible, though 1.5 and 1.6, or 1.7 and 1.8
would be compatible. However, point 10 of the FAQ
(http://www.open-mpi.org/faq/?category=sysadmin#new-openmpi-version) seems to
say compatibility is broader.
Also, the documents don't seem to address on-the-wire compatibility; that is,
if nodes on are different versions, can they work together reliably?
Thanks.
Ross
On Feb 5, 2014, at 9:58 PM, Ross Boylan <r...@biostat.ucsf.edu> wrote:
On 1/31/2014 1:08 PM, Ross Boylan wrote:
I am getting the following error, amidst many successful message sends:
[n10][[50048,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:118:mca_btl_tcp_frag_send]
mca_btl_tcp_frag_send: writev error (0x7f6155970038, 578659815)
Bad address(1)
I think I've tracked down the immediate cause: I was sending a very large
object (from R--I assume serialized into a byte stream) that was over 3G. I'm
not sure why it would produce that error, but it doesn't seem that surprising
that something would go wrong.
Ross
Any ideas about what is going on or what I can do to fix it?
I am using the openmpi-bin 1.4.2-4 Debian package on a cluster running Debian
squeeze.
I couldn't find a config.log file; there is
/etc/openmpi/openmpi-mca-params.conf, which is completely commented out.
Invocation is from R 3.0.1 (debian package) with Rmpi 0.6.3 built by me from
source in a local directory. My sends all use mpi.isend.Robj and the receives
use mpi.recv.Robj, both from the Rmpi library.
The jobs were started with rmpilaunch; it and the hosts file are included in
the attachments. TCP connections. rmpilaunch leaves me in an R session on the
master. I invoked the code inside the toplevel() function toward the bottom of
dbox-master.R.
The program source files and other background information is in the attached
file. n10 has the output of ompi_info --all, and n1011 has other info for
both nodes that were active (n10 was master; n11 had some slaves).
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users