One thing commonly done in this situation is for a user to simply download the 
OMPI tarball and install it in their own home directory, then link R etc to the 
updated version. This avoids impacting everyone else on the system and is a 
low-risk way of trying to see if the update fixes the problem.


On Feb 6, 2014, at 10:23 AM, Ross Boylan <r...@biostat.ucsf.edu> wrote:

> On 2/6/2014 3:24 AM, Jeff Squyres (jsquyres) wrote:
>> Have you tried upgrading to a newer version of Open MPI?  The 1.4.x series 
>> is several generations old.  Open MPI 1.7.4 was just released yesterday.
> It's on a cluster running Debian squeeze, with perhaps some upgrades to 
> wheezy coming.  However, even wheezy is at 1.4.5 (the next generation is 
> currently at 1.6.5).  I don't administer the cluster, and upgrading basic 
> infrastructure seems somewhat hazardous.
> 
> I checked for backports of more recent version (at backports.debian.org) but 
> there don't seem to be any for squeeze or wheezy.
> 
> Can we mix later an earlier versions of MPI?  The documentation at 
> http://www.open-mpi.org/software/ompi/versions/ seems to indicate that 1.4, 
> 1.6 and 1.7 would all be binary incompatible, though 1.5 and 1.6, or 1.7 and 
> 1.8 would be compatible.   However, point 10 of the FAQ 
> (http://www.open-mpi.org/faq/?category=sysadmin#new-openmpi-version) seems to 
> say compatibility is broader.
> 
> Also, the documents don't seem to address on-the-wire compatibility; that is, 
> if nodes on are different versions, can they work together reliably?
> 
> Thanks.
> Ross
>> 
>> 
>> On Feb 5, 2014, at 9:58 PM, Ross Boylan <r...@biostat.ucsf.edu> wrote:
>> 
>>> On 1/31/2014 1:08 PM, Ross Boylan wrote:
>>>> I am getting the following error, amidst many successful message sends:
>>>> [n10][[50048,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:118:mca_btl_tcp_frag_send]
>>>>  mca_btl_tcp_frag_send: writev error (0x7f6155970038, 578659815)
>>>>         Bad address(1)
>>>> 
>>> I think I've tracked down the immediate cause: I was sending a very large 
>>> object (from R--I assume serialized into a byte stream) that was over 3G.  
>>> I'm not sure why it would produce that error, but it doesn't seem that 
>>> surprising that something would go wrong.
>>> 
>>> Ross
>>>> Any ideas about what is going on or what I can do to fix it?
>>>> 
>>>> I am using the openmpi-bin 1.4.2-4 Debian package on a cluster running 
>>>> Debian squeeze.
>>>> 
>>>> I couldn't find a config.log file; there is 
>>>> /etc/openmpi/openmpi-mca-params.conf, which is completely commented out.
>>>> 
>>>> Invocation is from R 3.0.1 (debian package) with Rmpi 0.6.3 built by me 
>>>> from source in a local directory. My sends all use mpi.isend.Robj and the 
>>>> receives use mpi.recv.Robj, both from the Rmpi library.
>>>> 
>>>> The jobs were started with rmpilaunch; it and the hosts file are included 
>>>> in the attachments. TCP connections.  rmpilaunch leaves me in an R session 
>>>> on the master.  I invoked the code inside the toplevel() function toward 
>>>> the bottom of dbox-master.R.
>>>> 
>>>> The program source files and other background information is in the 
>>>> attached file.    n10 has the output of ompi_info --all, and n1011 has 
>>>> other info for both nodes that were active (n10 was master; n11 had some 
>>>> slaves).
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> 
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to