We saw the same problem with compilation,
the  workaround for us was configuring without vt (  ./configure --help ).
I hope vt guys will fix it somewhen .

Lenny.

On Mon, Feb 23, 2009 at 11:48 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
> It would be interesting to see what happens with the 1.3 build.
>
> It's hard to interpret the output of your user's test program without
> knowing exactly what that printf means...
>
>
> On Feb 23, 2009, at 4:44 PM, Jim Kusznir wrote:
>
>> I haven't had time to do the openmpi build from the nightly yet, but
>> my user has run some more tests and now has a simple program and
>> algorithm to "break" openmpi.  His notes:
>>
>> hey, just fyi, I can reproduce the error readily in a simple test case
>> my "way to break mpi" is as follows: Master proc runs MPI_Send 1000
>> times to each child, then waits for a "I got it" ack from each child.
>> Each child receives 1000 numbers from the Master, then sends "I got
>> it" to the master
>> running this on 25 nodes causes it to break about 60% of the time
>> interestingly, it usually breaks on the same process number each time
>>
>> ah. It looks like if I let it sit for about 5 minutes, sometimes it
>> will work. From my log
>> rank: 23 Mon Feb 23 13:29:44 2009 recieved 816
>> rank: 23 Mon Feb 23 13:29:44 2009 recieved 817
>> rank: 23 Mon Feb 23 13:29:44 2009 recieved 818
>> rank: 23 Mon Feb 23 13:33:08 2009 recieved 819
>> rank: 23 Mon Feb 23 13:33:08 2009 recieved 820
>>
>> Any thoughts on this problem?
>> (this is the only reason I'm currently working on upgrading openmpi)
>>
>> --Jim
>>
>> On Fri, Feb 20, 2009 at 1:59 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
>>>
>>> There won't be an official SRPM until 1.3.1 is released.
>>>
>>> But to test if 1.3.1 is on-track to deliver a proper solution to you, can
>>> you try a nightly tarball, perhaps in conjunction with our "buildrpm.sh"
>>> script?
>>>
>>>
>>>
>>> https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/buildrpm.sh
>>>
>>> It should build a trivial SRPM for you from the tarball.  You'll likely
>>> need
>>> to get the specfile, too, and put it in the same dir as buildrpm.sh.  The
>>> specfile is in the same SVN directory:
>>>
>>>
>>>
>>> https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/openmpi.spec
>>>
>>>
>>>
>>> On Feb 20, 2009, at 3:51 PM, Jim Kusznir wrote:
>>>
>>>> As long as I can still build the rpm for it and install it via rpm.
>>>> I'm running it on a ROCKS cluster, so it needs to be an RPM to get
>>>> pushed out to the compute nodes.
>>>>
>>>> --Jim
>>>>
>>>> On Fri, Feb 20, 2009 at 11:30 AM, Jeff Squyres <jsquy...@cisco.com>
>>>> wrote:
>>>>>
>>>>> On Feb 20, 2009, at 2:20 PM, Jim Kusznir wrote:
>>>>>
>>>>>> I just went to www.open-mpi.org, went to download, then source rpm.
>>>>>> Looks like it was actually 1.3-1.  Here's the src.rpm that I pulled
>>>>>> in:
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://www.open-mpi.org/software/ompi/v1.3/downloads/openmpi-1.3-1.src.rpm
>>>>>
>>>>> Ah, gotcha.  Yes, that's 1.3.0, SRPM version 1.  We didn't make up this
>>>>> nomenclature.  :-(
>>>>>
>>>>>> The reason for this upgrade is it seems a user found some bug that may
>>>>>> be in the OpenMPI code that results in occasionally an MPI_Send()
>>>>>> message getting lost.  He's managed to reproduce it multiple times,
>>>>>> and we can't find anything in his code that can cause it...He's got
>>>>>> logs of mpi_send() going out, but the matching mpi_receive() never
>>>>>> getting anything, thus killing his code.  We're currently running
>>>>>> 1.2.8 with ofed support (Haven't tried turning off ofed, etc. yet).
>>>>>
>>>>> Ok.  1.3.x is much mo' betta' then 1.2 in many ways.  We could probably
>>>>> help
>>>>> track down the problem, but if you're willing to upgrade to 1.3.x,
>>>>> it'll
>>>>> hopefully just make the problem go away.
>>>>>
>>>>> Can you try a 1.3.1 nightly tarball?
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to