On Feb 6, 2014, at 11:32 AM, Ross Boylan <r...@biostat.ucsf.edu> wrote:
> On 2/6/2014 11:08 AM, Jeff Squyres (jsquyres) wrote: >> In addition to what Ralph said (just install OMPI under your $HOME, at least >> for testing purposes), here's what we say about version compatibility: >> >> 1. OMPI started providing ABI guarantees with v1.3.2. The ABI guarantee we >> provide is that a 1.x and 1.(x+1) series will be ABI compatible, where x is >> odd. For example, you can compile against 1.5.x and still mpirun with a >> 1.6.x installation (assuming you built with shared libraries, yadda yadda >> yadda). >> >> 2. We have never provided any guarantees about compatibility between >> different versions of OMPI (even with a 1.x series). Meaning: if you run >> version a.b.c on one server, you should run a.b.c on *all* servers in your >> job. Wire line compatibility is NOT guaranteed, and will likely break in >> either very obnoxious or very subtle ways. Both are bad. >> >> However, per the just-install-a-copy-in-your-$HOME advice, you can have N >> different OMPI installations if you really want to. Just ensure that your >> PATH and LD_LIBRARY_PATH point to the *one* that you want to use -- both on >> the current server and all servers that you're using in a given job. And >> that works fine (I do that all the time -- I have something like 20-30 OMPI >> installs under my $HOME, all in various stages of development/debugging; I >> just updated my PATH / LD_LIBARY_PATH and I'm good to go). >> >> Make sense? > Yes. And it seems the recommended one for this purpose is 1.7, not 1.6. > > What should happen if I try to transmit something big? At least in my case > it was probably under 4G, which might be some kind of boundary (though it's a > 64 bit system). The key is that MPI defines its APIs as "int". Many 64-bit systems default "int" to 32-bit integers, which means you are limited to 2^31 messages. Outside of that constraint, there shouldn't be an issue other than memory footprint limitations. > > Ross >> >> >> On Feb 6, 2014, at 1:23 PM, Ross Boylan <r...@biostat.ucsf.edu> wrote: >> >>> On 2/6/2014 3:24 AM, Jeff Squyres (jsquyres) wrote: >>>> Have you tried upgrading to a newer version of Open MPI? The 1.4.x series >>>> is several generations old. Open MPI 1.7.4 was just released yesterday. >>> It's on a cluster running Debian squeeze, with perhaps some upgrades to >>> wheezy coming. However, even wheezy is at 1.4.5 (the next generation is >>> currently at 1.6.5). I don't administer the cluster, and upgrading basic >>> infrastructure seems somewhat hazardous. >>> >>> I checked for backports of more recent version (at backports.debian.org) >>> but there don't seem to be any for squeeze or wheezy. >>> >>> Can we mix later an earlier versions of MPI? The documentation at >>> http://www.open-mpi.org/software/ompi/versions/ seems to indicate that 1.4, >>> 1.6 and 1.7 would all be binary incompatible, though 1.5 and 1.6, or 1.7 >>> and 1.8 would be compatible. However, point 10 of the FAQ >>> (http://www.open-mpi.org/faq/?category=sysadmin#new-openmpi-version) seems >>> to say compatibility is broader. >>> >>> Also, the documents don't seem to address on-the-wire compatibility; that >>> is, if nodes on are different versions, can they work together reliably? >>> >>> Thanks. >>> Ross >>>> >>>> On Feb 5, 2014, at 9:58 PM, Ross Boylan <r...@biostat.ucsf.edu> wrote: >>>> >>>>> On 1/31/2014 1:08 PM, Ross Boylan wrote: >>>>>> I am getting the following error, amidst many successful message sends: >>>>>> [n10][[50048,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:118:mca_btl_tcp_frag_send] >>>>>> mca_btl_tcp_frag_send: writev error (0x7f6155970038, 578659815) >>>>>> Bad address(1) >>>>>> >>>>> I think I've tracked down the immediate cause: I was sending a very large >>>>> object (from R--I assume serialized into a byte stream) that was over 3G. >>>>> I'm not sure why it would produce that error, but it doesn't seem that >>>>> surprising that something would go wrong. >>>>> >>>>> Ross >>>>>> Any ideas about what is going on or what I can do to fix it? >>>>>> >>>>>> I am using the openmpi-bin 1.4.2-4 Debian package on a cluster running >>>>>> Debian squeeze. >>>>>> >>>>>> I couldn't find a config.log file; there is >>>>>> /etc/openmpi/openmpi-mca-params.conf, which is completely commented out. >>>>>> >>>>>> Invocation is from R 3.0.1 (debian package) with Rmpi 0.6.3 built by me >>>>>> from source in a local directory. My sends all use mpi.isend.Robj and >>>>>> the receives use mpi.recv.Robj, both from the Rmpi library. >>>>>> >>>>>> The jobs were started with rmpilaunch; it and the hosts file are >>>>>> included in the attachments. TCP connections. rmpilaunch leaves me in >>>>>> an R session on the master. I invoked the code inside the toplevel() >>>>>> function toward the bottom of dbox-master.R. >>>>>> >>>>>> The program source files and other background information is in the >>>>>> attached file. n10 has the output of ompi_info --all, and n1011 has >>>>>> other info for both nodes that were active (n10 was master; n11 had some >>>>>> slaves). >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users