One thing commonly done in this situation is for a user to simply download the OMPI tarball and install it in their own home directory, then link R etc to the updated version. This avoids impacting everyone else on the system and is a low-risk way of trying to see if the update fixes the problem.
On Feb 6, 2014, at 10:23 AM, Ross Boylan <r...@biostat.ucsf.edu> wrote: > On 2/6/2014 3:24 AM, Jeff Squyres (jsquyres) wrote: >> Have you tried upgrading to a newer version of Open MPI? The 1.4.x series >> is several generations old. Open MPI 1.7.4 was just released yesterday. > It's on a cluster running Debian squeeze, with perhaps some upgrades to > wheezy coming. However, even wheezy is at 1.4.5 (the next generation is > currently at 1.6.5). I don't administer the cluster, and upgrading basic > infrastructure seems somewhat hazardous. > > I checked for backports of more recent version (at backports.debian.org) but > there don't seem to be any for squeeze or wheezy. > > Can we mix later an earlier versions of MPI? The documentation at > http://www.open-mpi.org/software/ompi/versions/ seems to indicate that 1.4, > 1.6 and 1.7 would all be binary incompatible, though 1.5 and 1.6, or 1.7 and > 1.8 would be compatible. However, point 10 of the FAQ > (http://www.open-mpi.org/faq/?category=sysadmin#new-openmpi-version) seems to > say compatibility is broader. > > Also, the documents don't seem to address on-the-wire compatibility; that is, > if nodes on are different versions, can they work together reliably? > > Thanks. > Ross >> >> >> On Feb 5, 2014, at 9:58 PM, Ross Boylan <r...@biostat.ucsf.edu> wrote: >> >>> On 1/31/2014 1:08 PM, Ross Boylan wrote: >>>> I am getting the following error, amidst many successful message sends: >>>> [n10][[50048,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:118:mca_btl_tcp_frag_send] >>>> mca_btl_tcp_frag_send: writev error (0x7f6155970038, 578659815) >>>> Bad address(1) >>>> >>> I think I've tracked down the immediate cause: I was sending a very large >>> object (from R--I assume serialized into a byte stream) that was over 3G. >>> I'm not sure why it would produce that error, but it doesn't seem that >>> surprising that something would go wrong. >>> >>> Ross >>>> Any ideas about what is going on or what I can do to fix it? >>>> >>>> I am using the openmpi-bin 1.4.2-4 Debian package on a cluster running >>>> Debian squeeze. >>>> >>>> I couldn't find a config.log file; there is >>>> /etc/openmpi/openmpi-mca-params.conf, which is completely commented out. >>>> >>>> Invocation is from R 3.0.1 (debian package) with Rmpi 0.6.3 built by me >>>> from source in a local directory. My sends all use mpi.isend.Robj and the >>>> receives use mpi.recv.Robj, both from the Rmpi library. >>>> >>>> The jobs were started with rmpilaunch; it and the hosts file are included >>>> in the attachments. TCP connections. rmpilaunch leaves me in an R session >>>> on the master. I invoked the code inside the toplevel() function toward >>>> the bottom of dbox-master.R. >>>> >>>> The program source files and other background information is in the >>>> attached file. n10 has the output of ompi_info --all, and n1011 has >>>> other info for both nodes that were active (n10 was master; n11 had some >>>> slaves). >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users